Can You Trust AI Image Analysis? A Zapier and ChatGPT Study

Assessing the reliability of LLM Vision in practical applications

LLM’s Vision capability allows us to analyse and interpret images.

Automating the analysis of images can be useful. Images may be sent as an attachment as part of a production workflow. It may be useful to analyse these.

How reliable and accurate is this analysis? We set out to test this, and here are our findings.

Is LLM image analysis accurate?

This Article’s Use Case

For whatever reason, we need to analyze images. Extract the sentiment and theme of the image, as well as the details within the image. This could include text, promotional details, and product items.

This analysis is part of an automated workflow. As such, the accuracy of the image analysis is paramount.

Our Experiment Setup

We set up our experiment in the following manner:

  1. Analysed two image types;
    1. A banner image;
    2. A complex image of multiple promotional items;
  2. We analysed the images using:
    1. The ChatGPT web interface;
      1. Using ChatGPT-4o model;
      2. Using ChatGPT-4 (paid version);
    2. A Zapier automation triggered by a new email with attachments;
      1. Using the Vision event;
      2. Using the conversation with ChatGPT event;
  3. As a bonus, we compared the image analysis with the analysis of a PDF that is the same as the catalogue image. For this test, we use ChatGPT-4o and ChatGPT-4.
  4. The prompts are engineered for image analysis. They are slightly nuanced for each image type. We go into more detail on the prompts below.

Images we analysed

We analysed the following images:

Image
Banner image of month end products, image analysis
A retail chain's web banner for month end essentials
Image
Complex image of promotional items for image analysis
A low-resolution image of a retail chain's promotional catalogue.

The Prompts

We apply prompt engineering techniques to our prompts.

In both cases, the role we assign is that of a software agent with expert skills in analysing images. A specific skill is that of interpreting and extracting text from the image.

In addition, the tasks we assign are nuance to the image type. The task for the complex catalogue image is such that it leads the LLM to understand that the image it is analysing is a complex catalogue with multiple product promotions. Adding this level of specificity to the prompts ensures a more accurate result.

The Tests - Simple Banner Image

Test 1 - Web Interface - Simple Banner - ChatGPT-4o

### Overview
The image is an advertisement banner featuring various grocery products displayed on a kitchen counter. The theme is focused on promoting "Month-end essentials," encouraging customers to save on pantry items. The design uses a bright yellow background on the right side, with bold black text to catch attention, and a clean, modern kitchen setting on the left to showcase the products.

### Text Elements
1. **Month-end essentials**
2. **Save on all your pantry essentials.**
3. **Shop Now**
4. **FreshPak**
5. **Ariel**
6. **Oros**
7. **Jacobs**
8. **Koo**
9. **Canola Oil**

### Product and Text Elements:

ProductText ElementsConfidence Value
FreshPak Rooibos TeaFreshPak5
Ariel DetergentAriel5
Oros Orange SquashOros5
Jacobs CoffeeJacobs5
Koo Baked BeansKoo5
Canola OilCanola Oil5
Product Unrecognised100% Recycled Paper Towel (Roller)4

### Confidence Value
My confidence in the accuracy of this analysis is **5**.

Test 2 - Web Interface - Simple Banner - ChatGPT-4

### Overview:
The image is an advertisement for "Month-end essentials" featuring various pantry products. The background is a bright, clean kitchen setting, while the right side of the image has a vibrant yellow section with the promotional text. The theme focuses on essential household and grocery items available for purchase.

### Text Elements:
- "Month-end essentials"
- "Save on all your pantry essentials."
- "Shop Now"
- "NEW COLOUR PROTECT"
- "100% ROOIBOS TEA"
- "PURE ROOIBOS"
- "CANOLA OIL"
- "GOLD"

### Identified Products:

ProductText ElementsConfidence Value
Oros Orange SquashOROS5
Freshpak Rooibos TeaFreshPak, 100% ROOIBOS TEA, PURE ROOIBOS5
Canola OilCANOLA OIL, GOLD5
Jacobs CoffeeJACOBS5
Ariel Laundry DetergentARIEL, NEW COLOUR PROTECT5
Koo Green BeansKOO5
Roller Towel (Brand Unclear)1003

### Confidence Evaluation:
- Confidence Value: 5 for most products as they are clearly recognizable and matched with their respective text elements.
- Confidence Value: 3 for the Roller Towel as the brand is not clearly identifiable from the image.

My Observations
  1. The analysis is 'generally accurate'. Perhaps 80% of the detail is reflected;
  2. The overview and theme analysis is good;
  3. Running the same test within the same model, multiple times, produced slight variations;
  4. ChatGPT-4 produced more depth of information.

Test 3 - The Above Compared with Zapier Automation Vision Event

We sent the same banner image as an attachment to a Zapier automation.

Image
Banner image of month end products, image analysis
The Zap
Image
The Zapier Image Analysis Zap
The Zapier Image Analysis Zap

Step 4 of the Zap sends a report via email. Below is the report.

Image
Zapier image analysis email report
Email report on image analysis using Zapier and Vision event
My Observations

The analysis from the Zapier automation using Vision is very good. Better than the analysis using the web interface.

Why is the output better?

Within Zapier, we can adjust and specify the tokens and temperature.

  • Increasing the tokens ensures that the analysis does not get truncated;
  • Lowering the temperature ensures a more literal analysis. A higher temperature allows more creativity, variance and 'interpretation' in the LLM response.
Image
Adjusting tokens and temperature in Zapier image analysis automation
Adjusting tokens and temperature in Zapier image analysis automation.

The Tests - Complex Catalogue Image - Zapier Automation

We created two Zaps that automated the image analysis.

The Zap Flow

  1. Trigger — New Email Matching Search;
    1. Event: Search phrase for the trigger:
      1. Vision: subject:catalogue-analysis-vision;
      2. ChatGPT-4o: subject:vision-4-o;
  2. Google Drive;
    1. Event: Save file to Google Drive;
  3. ChatGPT;
    1. Event 1: Analyse image with Vision;
    2. Event 2: Conversation with ChatGPT;
  4. Gmail;
    1. Event: New Email.

Test 4 - Zapier Automation - Complex Catalogue - OpenAI Vision

We sent the complex catalogue image as an attachment to a Zapier automation.

Image
Zapier automation of complex catalogue image analysis - the report
The email report from the Vision event

View the text version of the report.

Test 5 - Zapier Automation - Complex Catalogue - OpenAI Conversation ChatGPT-4o

We sent the complex catalogue image as an attachment to a Zapier automation and used the ChatGPT Conversation event.

Image
Image analysis using Zapier automation and ChatGPT-4o model
Image analysis report using ChatGPT-4o and Zapier Automation

View the text version of the report.

My Observations

  1. The analysis of both tests was good;
  2. Both models were not 100% accurate in terms of the detail;
  3. Test 5 - using the Conversation Event produced a result with more depth.

The Marked Differences Between the Vision and Conversation Events

Using the Conversation event allowed for much more flexibility. Using the Conversation event, we are able to:

  1. Choose the model;
  2. Set the Tokens, Temperature, and Top p;
  3. Send a User (System) message as well as Assistant instructions.

The Vision event's options are limited:

  1. We can send instructions as a single prompt only;
  2. We can only adjust the tokens.

Test 6 - Analyse the Catalogue as a PDF Using ChatGPT-4o and ChatGPT-4

As a final test, we analysed the same complex catalogue image, but as a PDF.

We used Gpt-4o and GPT-4

PDF Analysis Using ChatGPT-4o

View the output results.

PDF Analysis Using ChatGPT-4

View the output results.

My Observations

While there is the impression of more depth, there is definitely more 'detail', the result is the least reliable.

  1. The Brand was given as Steelhill, not Makro;
  2. ChatGPT-4o and ChatGPT-4 produced almost identical results;
  3. While the product type was generally correct, the Brand Name is 'unrecognised' or incorrect;
  4. The relationship between the product and promotional details, in some cases, is confused;
    1. The details of the Typek and Alva Heater are confused.
  5. Brand names, which are graphical elements in the PDF, are not recognised.

Conclusion

This process gives me a good contextual understanding of the capacity and reliability of ChatGPT Vision. It is important to understand these tools on this level.

  1. Analysing general context, theme, colours, and composition, Vision is good;
  2. Vision is 'generally good' at extracting text detail from images. This is not 100% accurate nor 100% complete;
    1. Perhaps this could be improved with a more optimised prompt and iterations within the same test;
  3. As such, it would not be useful if accuracy and completeness is a requirement.

TL;DR

This article looks into the accuracy of LLMs in image analysis.

We analyse two image types, a simple web banner and a complex promotional catalogue for a retail chain.

We compare two environments with a mix of different models:

  1. ChatGPT Web Interface;
    1. ChatGPT-4o;
    2. ChatGPT-4 (paid);
  2. Zapier Automation using Gmail and ChatGPT;
    1. Event a: Analyse image with Vision;
    2. Event b: Conversation using GPT4-o.

Our observations are:

  1. Analysis of the theme, context, colours and composition is good;
  2. Analysis of the detail and detailed text elements is generally good yet:
    1. Not 100% accurate or complete;
    2. Inconsistent and varied in the inconsistency;

Article Resources

Download the prompts used in this tutorial.

The Video Tutorial

This tutorial is available on YouTube.

Subscribe

Contact Me

I can help you with your:

  1. Zapier Automations;
  2. AI strategy;
  3. Prompt engineering;
  4. Content creation;
  5. Custom GPTs.

I am available for remote freelance work. Please contact me.

Recent Articles

The following articles are of interest:

Natural Language Automation | Build a Knowledge Base from Email | Zapier Central

Beginning to Navigate the "Almost Zone": Exploring Authenticity in AI-Generated Content

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.