The following situation often occurs when I finally find the time to relax and play couch potato: I’m in the mood to watch a movie but I don’t quite know what to watch. I open my favourite streaming app and instantly I’m overwhelmed by all the content I don’t want to watch. The browsing begins. In most streaming apps I either need to know exactly what I’m looking for or I need to browse through a wide range of predefined categories that are by no means helpful. Most of the time I browse for a while and simply give up. How could this experience be improved? Semantic search to the rescue!
Semantic Search vs
Keyword Based Search
Semantic search leverages the power of a language model to interpret the semantics of words and phrases. Whereas traditional keyword based search methods rely on the fact that at least part of the query is located in the data verbatim; semantic search aims to include contextual meaning and intent into the equation. Meaning: it can even find relevant results if the search query does not contain any keywords that match the data. How does this work?
The data to be searched is transformed into so-called vector embeddings. In simple terms, these are numerical representations of data points that capture their inherent characteristics and relationships with other data points, allowing for operations such as similarity comparisons and classification. Within the context of semantic search the underlying data is by definition textual, but images, audio or video can be vectorized as well.
To demonstrate the power of semantic search I created an example that does a side by side comparison of, admittedly, very basic implementations of keyword and semantic search to find results in the same dataset. The dataset in question contains metadata for the IMDb top 1000 including a short description of the storyline. This piece of information will be the basis for my search implementation.
To set up semantic search functionality, I use the ChromaDB Python library in order to:
- Transform each description into vector embeddings (using the all-MiniLM-L6-v2 model)
- Create a persistent collection to store those embeddings
- Query the collection with a word or phrase for results
I limited both search methods to a maximum of ten results per query. When I use the phrase ‘world war 2’ the following results pop up:
The keyword search does not find any results, even though I know there are movies in the top 1000 that are set in the Second World War. This simply means that the combination of words is not in any of the descriptions. To be fair, the keyword search implementation is rather basic; it only looks for matches that exactly correspond with the query. In a real world application, this could be improved by introducing other capabilities like matching individual words in a query or compensate for typos.
The semantic search on the other hand, nicely handed me a list of ten movies. In this case, all the results are relevant. But, much like ChatGPT will always give you an answer, semantic search will always return the specified amount of results if no other boundaries are set. This can include false positives that are nonetheless included because no other data was deemed more relevant. Although ChromaDB doesn’t support this out-of-the-box, you could limit false positives by configuring a distance threshold.
As the example above shows, semantic search can be very useful to generate recommendations for articles, images and videos that are similar.
Likewise, online stores could benefit from semantic search by way of easier navigation and better suggestions for potential buyers; especially if the product catalogue contains products from multiple sources where data is potentially unstructured instead of neatly normalised and categorised.
Another area that could benefit from semantic search is document retrieval within organisations. Often, documents are scattered across various systems, making traditional searches challenging. Semantic search offers a more efficient and intuitive solution, significantly enhancing the accessibility of relevant information.
This could even serve as a stepping stone towards training an AI model using an organisation’s knowledge base as training data.
Semantic search offers substantial advantages by taking into account context and intent of user queries, delivering more accurate and relevant results. This approach can potentially improve information retrieval across various domains. However, its complexity and computational demands, relying on advanced language understanding through language models, pose challenges. On the flip side, keyword based searches are simple to implement and may suffice in many situations. Both methods have merits, and leveraging their strengths can optimise the user experience without unnecessary complexity.