LLAVA (Large Language and Vision Assistant) is an LLM that combines advanced natural language processing capabilities with visual understanding. In this article we will see how to use LLAVA with Ollama to turn images into text. We will then explore the potential and possible applications of this model.
What is LLAVA?
The history of LLAVA begins with the development of large language models (LLMs) which have demonstrated an astonishing ability to generate coherent, contextually relevant text. However, the need to integrate language understanding with visual perception has led to the creation of models such as LLAVA. This ability to acquire information also from multimedia elements allows the model to provide more complete and contextual answers.
To give some simple examples of what LLAVA can do, let’s imagine, for example, that we want to describe the content of an image or identify the emotional tone of a voice.
Use LLAVA with Ollama
We have already seen how to use Ollama to run LLM models locally. The procedure to follow with LLAVA will be the same and first you need to download the model which has a total weight of approximately 4.7 GB. The command to execute is the following:
ollama pull llava
Once LLAVA is downloaded, you can run it with:
ollama run llava
At this point we are ready to do our test and for this purpose we will use the JPEG image below which shows a mouse eating cheese.
To test the LLM we will provide it with a very simple input in which we will ask it to describe the content of the image and we will give it the path of the file system where we saved the file:
Can you tell me what the following image depicts? <path>/mouse eating cheese.jpg
We will notice that as soon as the command is sent Ollama will upload the image and have it processed by the model. The response will be textual and will contain a description like the following:
The image shows a cute little mouse with big eyes, standing on its hind legs and holding a wedge of cheese with its front paws. The mouse is next to another wedge of cheese, which it appears to be eating or examining. There's a block of wood behind the mouse, possibly serving as a surface for the cheese. The overall scene looks like a playful or whimsical setup, likely staged for humorous effect.
Use LLAVA via Ollama REST API
As we have already seen in our blog, Ollama also has REST APIs that allow you to integrate LLMs with external applications. To pass multimedia files as input it is necessary to enhance the “images” parameter with an array of images converted to base64 format:
curl --location 'http://localhost:11434/api/chat' \
--header 'Content-Type: application/json' \
--data '{
"model": "llava",
"messages": [
{
"role": "user",
"content": "Can you tell me what the following image depicts?",
"images": ["<base64>"]
}
],
"stream": false
}'
How to use the visual ability of LLMs?
The visual capability of LLM models opens up a huge number of potential applications. Some simple examples could be the following:
- Converting images into a text description.
- Classification of images based on content;
- Multimedia search algorithms.
The ability to convert the image to text is particularly interesting because it allows you to standardize information sources. For example, it is possible to transform a document containing multimedia files into plain text which can then be processed using other models or via complex RAG architectures.
LLAVA offers a powerful combination of linguistic and visual capabilities that opens new frontiers in various sectors. From healthcare and education assistants to technical support and security solutions, the practical applications of this LLM improve the efficiency and effectiveness of daily operations.
As the technology continues to evolve, we can expect to see an increasing number of innovative uses that take full advantage of the potential of AI.