When we think of AI models, we often think of classic chatbot assistants, capable of understanding natural language and generating increasingly human-like responses. However, the most advanced models have greatly expanded their scope, thanks to the ability to process a wide range of multimedia formats, such as images, audio, and video. Among the most fascinating skills, computer vision occupies a prominent place, allowing AI models to analyze and understand the visual content of images.
One of the most striking examples of these advanced capabilities is OpenAI’s GPT model. Originally known for its ability to understand and generate text, GPT has evolved its skills to include image analysis. This means that, in addition to answering textual questions, GPT can now examine the visual content of images, identify objects, describe scenes, and even interpret emotions and contexts.
This ability to understand the content of images opens up new frontiers for numerous practical applications. For example, it can be used to improve visual assistance systems for the blind, automate content moderation in social platforms, or enhance visual search capabilities in search engines and e-commerce catalogs. In addition, the combination of text and visual analysis allows GPT to provide highly relevant and contextualized answers, significantly increasing the effectiveness and precision of interactions.
The integration of these multimedia capabilities into AI models is not only a technological breakthrough, but also redefines the way we interact with machines, making user experiences more intuitive, complete and satisfying. As we will see in this article, OpenAI’s vision APIs offer powerful tools to exploit these innovations, allowing developers to create increasingly intelligent and versatile applications.
Tutorial prerequisites
Before starting, the following prerequisites must be met:
- Have an OpenAI account and have a valid API KEY (define the OPENAI_API_KEY environment variable);
- Install Python 3.7 or higher;
- Save an image to analyze locally (in our case we used the cover image of this article that we saved as
image.webp
).
Python implementation
Image conversion to Base64
To send an image to the OpenAI API, we must first convert it to Base64 format. This format allows you to represent binary data as text strings, making it easier to send via HTTP requests.
For this purpose we can use this very simple Python class:
import base64
class ImgToBase64Converter():
def __init__(self, image_path:str):
with open(image_path, "rb") as image_file:
self.base64:bytes = base64.b64encode(image_file.read()).decode('utf-8')
def get_base64(self) -> bytes:
return self.base64
Here are some comments to understand how it works:
- First of all, you need to read the file containing the image using the
with open(image_path, "rb") as image_file
statement. This statement opens the file in binary reading mode (“rb”). Using thewith
construct ensures that the file is closed properly after being read. - The cornerstone of the program is the
base64.b64encode
function that allows you to transform a file into a sequence of bytes using base64 encoding.
This class has been saved in a converter.py
module that we will use in the next step.
Invoking the vision API
At this point, let’s see how to invoke the vision API using the GPT-4o-mini model:
import os, requests
from converter import ImgToBase64Converter
base64_image = ImgToBase64Converter("image.webp").get_base64()
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"
}
payload = {
"model": "gpt-4o-mini",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe the content of the image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
],
"max_tokens": 300
}
res = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
print(res.json()['choices'][0]['message']['content'])
- To authenticate our request to the OpenAI APIs, we need to include the API key in the request headers. We have therefore used the
os.environ
function to retrieve the value of the related environment variable. We have also specified the content type asapplication/json
. - The request payload contains the model to use, the messages to send and other parameters such as the maximum number of tokens to generate. In this case, we ask the model to describe the content of the image.
- We have defined the model to use (
gpt-4o-mini
) and we have sent two types of content: a text that asks to describe the image and the image itself in Base64 format. - Using the requests library it is possible to send a POST request to the OpenAI APIs. The response contains the text generated by the model, which describes the content of the image.
By trying to run the code we can verify the excellent level of understanding that the model is able to achieve. For example, in our case we received the following response:
The image depicts a famous painting, likely the Mona Lisa, displayed in an art gallery. In front of the painting, there are three small, round robots, seemingly observing the artwork. The gallery walls are adorned with other framed artworks, creating an artistic and cultural ambiance. The scene combines traditional art with futuristic technology, illustrating a unique interaction between robots and classic art.
In this tutorial, we saw how to convert an image to Base64, set up a request to the OpenAI Vision API, create the payload with the model and messages, and finally send the request and handle the response. By following these steps, you can easily integrate OpenAI’s computer vision capabilities into your Python projects, enabling advanced image analysis and enhancing the functionality of your applications.