💡 TLDR: GPT-4 with vision (GPT-4V) is now out for many ChatGPT Plus users in the US and some other regions! You can instruct GPT-4 to analyze image inputs. GPT-4V incorporates additional modalities such as image inputs into large language models (LLMs). Multimodal LLMs will expand the reach of AI from mainly language-based applications to a broad range of brand-new application categories that go beyond language user interfaces (UIs).
👆 GPT-4V could explain why a picture was funny by talking about different parts of the image and their connections. The meme in the picture has words on it, which GPT-4V read to help make its answer. However, it made an error. It wrongly said the fried chicken in the image was called “NVIDIA BURGER” instead of “GPU”.
Still impressive! 🤯 OpenAI’s GPT-4 with Vision (GPT-4V) represents a significant advancement in artificial intelligence, enabling the analysis of image inputs alongside text.
Let’s dive into some additional examples I and others encountered:
Prompting GPT-4V with
"How much money do I have?" and a photo of some foreign coins:
GPT4V was even able to identify that these are Polish Zloty Coins, a task with which 99% of humans would struggle:
It can also identify locations from photos and give you information about plants you make photos of. In this way, it’s similar to Google Lens but much better and more interactive with a higher level of image understanding.
It can do optical character recognition (OCR) almost flawlessly:
Now here’s why many teachers and professors will lose their sleep over GPT-4V: it can even solve math problems from photos (source):
GPT-4V can do object detection, a crucial field in AI and ML: one model to rule them all!
GPT-4V can even help you play poker ♠️♥️
A Twitter/X user gave it a screenshot of a day planner and asked it to code a digital UI of it. The Python code worked!
Speaking of coding, here’s a fun example by another creative developer, Matt Shumer:
"The first GPT-4V-powered frontend engineer agent. Just upload a picture of a design, and the agent autonomously codes it up, looks at a render for mistakes, improves the code accordingly, repeat. Utterly insane." (source)
I’ve even seen GPT-4V analyzing financial data like Bitcoin indicators:
I could go on forever. Here are 20 more ideas of how to use GPT-4V that I found extremely interesting, fun, and even visionary:
- Visual Assistance for the Blind: GPT-4V can describe the surroundings or read out text from images to assist visually impaired individuals.
- Educational Tutor: It can analyze diagrams and provide detailed explanations, helping students understand complex concepts.
- Medical Imaging: Assist doctors by providing preliminary observations from medical images (though not for making diagnoses).
- Recipe Suggestions: Users can show ingredients they have, and GPT-4V can suggest possible recipes.
- Fashion Advice: Offer fashion tips by analyzing pictures of outfits.
- Plant or Animal Identification: Identify and provide information about plants or animals in photos.
- Travel Assistance: Analyze photos of landmarks to provide historical and cultural information.
- Language Translation: Read and translate text in images from one language to another.
- Home Decor Planning: Provide suggestions for home decor based on pictures of users’ living spaces.
- Art Creation: Offer guidance and suggestions for creating art by analyzing images of ongoing artwork.
- Fitness Coaching: Analyze workout or yoga postures and offer corrections or enhancements.
- Event Planning: Assist in planning events by visualizing and organizing space, decorations, and layouts.
- Shopping Assistance: Help users in making purchasing decisions by analyzing product images and providing information.
- Gardening Advice: Provide gardening tips based on pictures of plants and their surroundings.
- DIY Project Guidance: Offer step-by-step guidance for DIY projects by analyzing images of the project at various stages.
- Safety Training: Analyze images of workplace environments to offer safety recommendations.
- Historical Analysis: Provide historical context and information for images of historical events or figures.
- Real Estate Assistance: Analyze images of properties to provide insights and information for buyers or sellers.
- Wildlife Research: Assist researchers by analyzing images of wildlife and their habitats.
- Meme Creation: Help users create memes by suggesting text or edits based on the image provided.
These are truly mind-boggling times. Most of those ideas are million-dollar startup ideas. Some ideas (like the real estate assistance app #18) could become billion-dollar businesses that are mostly built on GPT-4V’s functionality and are easy to implement for coders like you and me.
If you’re interested, feel free to read my other article on the Finxter blog:
📈 Recommended: Startup.ai – Eight Steps to Start an AI Subscription Biz
What About SaFeTY?
GPT-4V is a multimodal large language model that incorporates image inputs, expanding the impact of language-only systems by solving new tasks and providing novel experiences for users. It builds upon the work done for GPT-4, employing a similar training process and reinforcement learning from human feedback (RLHF) to produce outputs preferred by human trainers.
Why RLHF? Mainly to avoid jailbreaking 😢😅 like so:
You can see that the “refusal rate” went up significantly:
From an everyday user perspective that doesn’t try to harm people, the
"Sorry I cannot do X" reply will remain one of the more annoying parts of LLM tech, unfortunately.
However, the race is on! People have still reported jailbroken queries like this: 😂
I hope you had fun reading this compilation of GPT-4V ideas. Thanks for reading! ♥️ If you’re not already subscribed, feel free to join our popular Finxter Academy with dozens of state-of-the-art LLM prompt engineering courses for next-level exponential coders. It’s an all-you-can-learn inexpensive way to remain on the right side of change.
For example, this is one of our recent courses:
Prompt Engineering with Llama 2
💡 The Llama 2 Prompt Engineering course helps you stay on the right side of change. Our course is meticulously designed to provide you with hands-on experience through genuine projects.
You’ll delve into practical applications such as book PDF querying, payroll auditing, and hotel review analytics. These aren’t just theoretical exercises; they’re real-world challenges that businesses face daily.
By studying these projects, you’ll gain a deeper comprehension of how to harness the power of Llama 2 using 🐍 Python, 🔗🦜 Langchain, 🌲 Pinecone, and a whole stack of highly ⚒️🛠️ practical tools of exponential coders in a post-ChatGPT world.
Jean is a tech enthusiast with a love for AI and machine learning innovations, particularly LLMs. Beyond contributing insightful articles to our blog, Jean has worked as a Python, Rust, and Go coder for one of the leading tech firms in the world.