ChatGPT 4 and Images

ChatGPT can do some interesting things with images, and can't do some things you'd expect.

Nov 01, 2023

You can send images to ChatGPT as part of your prompt now, and I had a trial I wanted to run. As part of identifying potential customers I’d like to talk to, I’ve found myself taking pictures of bookshelves, and would rather not spend a couple hours writing titles down.

Takeaways

Image prompting is interesting, but really early.
Text extraction works but has a surprisingly high hallucination rate.
Orientation matters - right-side-up text works better than sideways, upside-down text performs the worst.
You can get meaningfully better performance with prompt engineering.
I couldn’t get spacial reasoning (identify the bottom shelf) to work really at all.

First Try: Title Identification

I took this picture at a potential customer’s office1, and wanted to see what ChatGPT could do.

Unfortunately, OpenAI doesn’t support linking conversations with images yet, so there’s going to be a lot of copy paste and tedious numbered lists below.

Can you identify the books in this image?

Of course! Here are the titles of the books I can identify from the image:
"Expecting Better" by Emily Oster
"Cribsheet" by Emily Oster
"Bringing Up Bébé" by Pamela Druckerman
"Hunt, Gather, Parent" by Michaeleen Doucleff, PhD
"Steal Like an Artist" by Austin Kleon
"The Wonder Weeks"
"The Gentle Sleep Book"
"Precious Little Sleep"
"The Simplest Baby Book"
"Baby 411"
There are several other books in the image, but their titles are either partially obscured or not visible enough for me to accurately identify them. If you have any specific questions about any of these books or others in the image, let me know!

10 is kind of a suspicious number, and I can make out way more titles than that pretty easily.

Are you sure there aren't any more you can make out? I can see "Your Baby" quite clearly

I apologize for the oversight. Let me give it another try:
"Expecting Better" by Emily Oster
"Cribsheet" by Emily Oster
"Bringing Up Bébé" by Pamela Druckerman
"Hunt, Gather, Parent" by Michaeleen Doucleff, PhD
"Steal Like an Artist" by Austin Kleon
"The Wonder Weeks"
"The Gentle Sleep Book"
"Precious Little Sleep"
"The Simplest Baby Book"
"Baby 411"
"Your Baby Week By Week"
Thank you for pointing it out. Let me know if there's anything else you'd like to know.

That didn’t do what I wanted at all (and needing to provide a title to encourage it doesn’t really scale). When I was generating conversation prompts, I had to do it 10 at a time or so because there were so many characters in each line item, but that shouldn’t be the case here. Let’s see if it’s paginating:

Are there ten more that you can see in that image?

Certainly! Let's see:
"No Bad Kids: Toddler Discipline Without Shame" by Janet Lansbury
"How to Raise a Healthy Child in Spite of Your Doctor" by Robert S. Mendelsohn M.D.
"How to Talk So Little Kids Will Listen"
"The Baby Sleep Solution"
"The Science of Mom"
"The Sh!t No One Tells You About Pregnancy"
"The Whole-Brain Child"
"Moms on Call Basic Baby Care"
"The Happiest Baby on the Block"
"The New Basics: A-to-Z Baby & Child Care for the Modern Parent"
Please note that while I've tried to identify as many titles as possible, some titles may be abbreviated or based on what's visible in the image. Let me know if you have any other questions!

This response is weird. It’s paginating, but it reset the numbering (something it didn’t do with conversation prompts). Also, most of these books aren’t on the shelf - I can only see 3 out of these 10 books in the image. Looking at the first one, 9 of the 10 books are there but one is a real book that isn’t present.

Improving Responses

Simplifying the Problem

There are bits of low-hanging fruit that I see - the first is not having all these partial books. I cropped the image down to just the complete shelves.

LLMs typically work better with step by step instructions. If it can count the number of books first, then maybe I can ask it for that many titles. I think this step helped, but the number of books returned isn’t consistent:

Using the LLM’s book count to prompt for a number of titles got me the right number of titles back, but way too many of them are were made up.

Asking for Confidence

With some trial and error, I found that I could get better results if I asked it to categorize its responses by confidence:

How many books do you see in this image?

I see approximately 34 books in the image.2

Please respond with the title of each of the 34 books.  If you can read part of it but aren't sure you've got the whole thing, please return your best guess prefaced with the string "Maybe: ".  If you can't read it at all, describe the book or the book closest to it and preface it with "Description: ".

Sure, here's a list of the books based on what I can identify from the image:
"HUNT, GATHER, PARENT"
Maybe: "Write Useful Books"
"STEAL LIKE AN ARTIST"
"YOUR BABY WEEK BY WEEK"
"WHY LOVE MATTERS"
"THE WONDER WEEKS"
Maybe: "A Mind At Home"
Maybe: "RAISING..."
"The GENTLE SLEEP BOOK"
Maybe: "Danish Way of Parenting"
Maybe: "How Toddlers Thrive"
"The Nurtured Heart Approach"
"GRIEF CLUB"
Maybe: "Parenting... from the Inside Out"
"PRECIOUS LITTLE SLEEP"
Maybe: "The Harvard Business Review"
"The SIMPLEST BABY BOOK"
Maybe: "The BABY... 41"
Maybe: "I'll Show Myself Out"
Maybe: "The Newborn Baby Manual"
"BEENY'S RESTORATION NIGHTMARE"
Maybe: "WHAT TO EXPECT"
"THE FIRST 6 WEEKS"
Description: A book with a teal spine next to "THE FIRST 6 WEEKS".
Description: A white book below "THE FIRST 6 WEEKS".
Description: A book with a pink and white spine next to "BEENY'S RESTORATION NIGHTMARE".
Description: A book with a light purple spine next to the above-described book.
Description: A book with a white spine and an image of a child.
Description: A pink book below "The GENTLE SLEEP BOOK".
Description: A book with a white spine and baby icons next to "The SIMPLEST BABY BOOK".
Description: A book with a white spine and green lettering below the above-described book.
Description: A book with a blue spine below "PRECIOUS LITTLE SLEEP".
Description: A book with a white spine and a drawing of a brown object (possibly an insect).
Description: A book with a light green spine below the above-described book.
Note: The "Maybe:" entries indicate that I've made a best guess based on partially visible titles. The "Description:" entries are for those books where the title is not visible at all.

There are 12 titles, 11 maybes, and 11 descriptions. Titles were correct 9/12 times, maybes were correct 6/11 times. Not great, but getting there.

Orientation

Maybe it doesn’t like reading sideways. I tried the same strategy with a rotated image:

This one counted 28 books on the first try, 19/20 of the titles were correct, 5/6 maybes were correct. Much better.

For completeness’ sake, I tried upside-down too:

Counted 27 books, 2 titles, 17 maybes, 8 descriptions. Titles 2/2, maybes 7/17. I’m comfortable saying that orientation matters for text reading.

Cropping & Shelf Identification

So cropping and rotating seem helpful, but I’ve got a bunch of pictures I’d like to process. It’d be great if it could do the cropping for me:

Are you able to provide images in the response? Can you crop the image just to the second shelf?

I'm sorry, but I cannot create or modify images. However, I can guide you on how to crop the image using various software or applications if you'd like. Let me know if you need any assistance with that!

That’s too bad. But if I could get the coordinates, I bet I could pass those coordinates to something else, and could create a Rube-Goldberg machine that does this for me:

Can you provide the bounding coordinates for cropping the second shelf in this image?

Certainly. To determine the bounding coordinates, I'll analyze the provided image to identify the region corresponding to the second shelf.
Here are the bounding coordinates for cropping the second shelf:
Top-left corner: (x1, y1)
Bottom-right corner: (x2, y2)
Analyzing the image...
The bounding coordinates for cropping the second shelf in the image are approximately:
Top-left corner: (x1: 0, y1: 210)
Bottom-right corner: (x2: 1024, y2: 385)
Using these coordinates, you can crop the image to isolate the second shelf. If you're using an image editing software, you can input these coordinates to get the desired crop.

Woo! Now to manually find out what I’ve got.

I tried a few more times with pretty similar results. Finally I became concerned that maybe it can’t see the shelf. Note: this response is edited, and my comments are in [brackets]:

How many book titles can you read on the bottom shelf?

Sure, here's a list of the titles I can discern from the bottom shelf:
"PRECIOUS LITTLE SLEEP" [bottom shelf]
"SIMPLE BABY FOOD RECIPES" [hallucination]
"BABY 411" [bottom shelf]
"HAPPIEST BABY ON THE BLOCK" [second shelf]
"BRINGING UP BÉBÉ" [top shelf, Bebe Day by Day is on the bottom]
"PARENTING FROM THE INSIDE OUT" [hallucination]
Note: Some titles might be inferred from the visible text and context, while others are based on their description or color.

I can read 13 on the bottom. It got 3 from the bottom, 2 from other shelves, and completely made up 2 other books. I’m not completely confident that it can’t see the shelf, but it definitely struggles.

The hallucinations are really interesting - it says it’s guessing because of the “description or color” of the book, but I wonder if it isn’t just tossing in a few more books that might appear on a shelf like this.

I tried a few more times with similar results before trying to give it some help:

Can you provide the bounding coordinates for cropping the second shelf from the bottom in this image?  All of the book titles should be included in the image, starting from the red book "trauma-proofing your kids" to the blue book "gentle sleep", and the white book on top "hunt, gather, parent".

Based on the image you provided, I've determined the approximate bounding coordinates for cropping to capture the second shelf from the bottom. The coordinates will be in the format (x1, y1, x2, y2), where (x1, y1) represents the top-left corner and (x2, y2) represents the bottom-right corner of the desired area.
Here are the bounding coordinates:
(x1, y1) = (0, 480)
(x2, y2) = (900, 925)
These coordinates are approximate and based on the resolution of the image you provided. You can use these coordinates as a starting point and adjust as necessary in an image editing software to achieve the desired crop.

Better, but really not really good enough to use:

The formatting for this seems workable (ChatGPT could provide coordinates, and you’ll be able to get those into JSON). For my purposes, this isn’t a terrible first pass (when it makes stuff up it returns real, related books), but right now the accuracy doesn’t seem good enough for most use cases.

With permission!

34 wasn’t even on my histogram!

Ben Dennis

Nov 1, 2023

1. Re: hallucinations: Interesting how much it struggles to admit when it doesn't know something (Maybe it learned that from us humans too 😭 😭)

2. If I see a flyer on the street for an event I'm interested in or a friend sends me a picture of a virtual flyer, I often want to add these things to my calendar. But, it's tedious to copy down all the details into a calendar event. I wonder if I can take a picture of the flyer and ask ChatGPT to make a link that creates a calendar event? I'll test it out!

Expand full comment

3 replies by Rob and others

3 more comments...

Robbobobbo

Discussion about this post