We’ve been busy experimenting in our Generative AI Playground again! After exploring use cases for multi-agent GPT systems, function calling, and report generation, we had another use case in mind we wanted to experiment on: AI-enhanced (visual) search. In this article, we share an exciting demo in which we combined Google’s newest multimodal AI model, Gemini, with an Elastic database (thanks to our partner, Elk Factory). Let’s dive into part 4 — visual search.
Platforms like Google Images have been around for years, allowing users to search for images based on keywords. Think of your favorite webshop, where your preferred item (hopefully) pops up when you search for a certain term in the search bar. Typically, setting up such search functionalities involves manually tagging each image or product with relevant keywords and metadata; a time-consuming and sometimes inaccurate process.
So, what if you could search for images without these predefined tags? What if you could simply describe what you’re looking for in natural language, and the system could understand and retrieve exactly that, without any human intervention?
Imagine: your company has released thousands of products in the past decade. The only problem: you don’t have a structured database of all product packaging and its specific information. And you need it. Traditional methods would falter here — it would simply take too much time and effort to create such a database — but our approach transforms this challenge into a seamless experience.
For demonstration purposes, we decided to use a database of 16.000 clothing product images (shirts, hoodies, blouses, …). For these 16.000 images, we set out to automate the tagging process with Gemini’s multimodal capabilities:
The result? A robust visual database that resides in Elastic, ready for querying, with minimal human effort.
Once the data is embedded in Elastic, the real magic begins. For this demo, we wanted to explore the capabilities of both Elastic and Gemini when it came to search. So, we tested out two approaches:
For our first approach, we utilize Elastic’s ElserV2 model to create embeddings from the textual description, allowing for nuanced search capabilities. This system embeds the user’s search query to find the closest matches based on the image description. For the second approach, we send the user query to Gemini, which in turn transforms it into structured search fields, which are used to retrieve the relevant images based on these specifics. Both approaches work amazingly and extremely fast.
The result is a system where a user can describe what they want in natural language with a high level of detail, specifying parameters such as type, color, text size, actual text, pattern, logo, and so on. The system, in turn, accurately returns only the relevant items. Think of queries like “I am looking for a red striped t-shirt with large text on the front.”
As a bonus, these search queries work seamlessly in a multilingual environment. For example, one could just as easily say, "Cerco una maglietta a righe rosse con un testo grande sul davanti" and get the same results!
This experiment is not just about showcasing AI’s capability to enhance image search. It’s about showing businesses they can now unlock historical data, provide richer customer experiences, and streamline their operations—all through the power of AI. To show the possibilities, we translated this experiment into some other use cases. Of course, this is a non-exhaustive list of what we could create:
Tackling visual search this way changes the way we set up search systems for e-commerce and other use cases. By using multimodal LLMs to our advantage, we can skip a lot of manual work and set up a flow that is even more powerful than it was before. As we continue exploring different technologies in the Generative AI Playground, we're excited about the endless possibilities these technologies bring to our lives. On to the next!
Written by
Daphné Vermeiren
Want to know more?