Semantic Search Feature

Semantic Search Feature

Daily short news for you
  • For a long time, I have been thinking about how to increase brand presence, as well as users for the blog. After much contemplation, it seems the only way is to share on social media or hope they seek it out, until...

    Wearing this shirt means no more worries about traffic jams, the more crowded it gets, the more fun it is because hundreds of eyes are watching 🤓

    (It really works, you know 🤭)

    » Read more
  • A cycle of developing many projects is quite interesting. Summarized in 3 steps: See something complex -> Simplify it -> Add features until it becomes complex again... -> Back to a new loop.

    Why is that? Let me give you 2 examples to illustrate.

    Markdown was created with the aim of producing a plain text format that is "easy to write, easy to read, and easy to convert into something like HTML." At that time, no one had the patience to sit and write while also adding formatting for how the text displayed on the web. Yet now, people are "stuffing" or creating variations based on markdown to add so many new formats that… they can’t even remember all the syntax.

    React is also an example. Since the time of PHP, there has been a desire to create something that clearly separates the user interface from the core logic processing of applications into two distinct parts for better readability and writing. The result is that UI/UX libraries have developed very robustly, providing excellent user interaction, while the application logic resides on a separate server. The duo of Front-end and Back-end emerged from this, with the indispensable REST API waiter. Yet now, React doesn’t look much different from PHP, leading to Vue, Svelte... all converging back to a single point.

    However, the loop is not bad; on the contrary, this loop is more about evolution than "regression." Sometimes, it creates something good from something old, and people rely on that goodness to continue the loop. In other words, it’s about distilling the essence little by little 😁

    » Read more
  • Alongside the official projects, I occasionally see "side" projects aimed at optimizing or improving the language in some aspects. For example, nature-lang/nature is a project focused on enhancing Go, introducing some changes to make using Go more user-friendly.

    Looking back, it resembles JavaScript quite a bit 😆

    » Read more

Problem

Hi there, 2coffee.dev readers. Have you noticed that the autumn atmosphere has become clearer in Hanoi recently? The morning was cool, and the evening came with strong winds. But behind that was a busy week for me. I was focused on running the "deadline" for my company's project, and in the evening, I tried to complete the search function for my blog. This deadline was different from usual because it was the main feature for the year for the product. And as for the blog, the search function had to be completed sooner or later, and this was the perfect time to do it.

Before switching to Fresh, my blog already had a search function. The way I did it back then was to use Postgres's fulltext-search. For those who don't know, before using Postgres, I also used redisearch for search. Generally speaking, Postgres gave better results, while redisearch was more complex. But in reality, my blog data wasn't that massive, so redisearch didn't have a chance to shine.

When I switched to Fresh, AI was booming. Many people were talking about AI and what it could do. After completing the basic features and getting ready to work on the search function, I thought, "Why not try using AI?" So, I decided to "release" the new blog version without the search function.

To create a search function with AI, I had to spend a lot of time researching and experimenting. I learned about how to implement it, how to use LLMs models, embeddings models, vector data types, how to convert data to vectors, and how to query...

To put it simply, a vector is a finite set of numbers, like in mathematics. The number of elements in the set determines the size (dimension) of the vector. The larger the size, the more the vector can generalize the data it represents. To convert regular data (text, speech, images, etc.) to vectors, there are many ways, but thanks to the popularity of LLMs today, you can just put the data into an embeddings model, and it will give you vector data.

Semantic search (semantic search) is different from traditional fulltext keyword search. Fulltext search is based on the amount of text characters entered to match and return the most relevant results. Meanwhile, semantic search is based on the content. Suppose your article is explaining how node.js works. When searching for the phrase "node.js hoạt động như thế nào?" (node.js how it works), semantic search can find the article. On the other hand, fulltext search will try to find articles containing the words "node.js", " 活 động", "như", ...

To query vector data, you need at least two steps. First, convert the query into a vector, then use the query functions. For example, with pg-vector - a Postgres extension that supports vectors - there are query functions like:

pg-vector search

You can see L2 distance, Cosine distance, L1 distance... as vector comparison methods. Depending on the use case, you choose the query type accordingly. For example, in the search problem, I chose the Cosine distance method - that is, the two vectors should have a similar shape.

How to do it

Flow

First, choose a suitable database. I'm using Turso as my main database. However, Turso is based on SQLite, which isn't optimized for vector data. Although they introduced an extension to support vectors, it's a bit complicated.

pg-vector is the opposite. It's widely used and is a Postgres extension. When it comes to Postgres, I think of Supabase, which offers free usage. Supabase has pg-vector integrated, and activation is just a click away, making it a great choice.

Next is choosing models. To save costs, I've been looking for free models from the start. I couldn't help but mention groq with its Completions API. However, groq doesn't have embeddings models, so I had to find another one.

nomic-embed-text is an embeddings model I found in Ollama's library. It can vectorize text. Additionally, Nomic provides a free embeddings API with limitations. However, I should remind you that Nomic isn't a multilingual model. It supports Vietnamese to a limited extent, so the generated vector might not be optimal for Vietnamese semantics.

After preparing everything, it's time to write code to add vector data and search logic.

First, convert the article content into a vector and store it in Supabase. Instead of converting the entire article content, I summarize the main content of the article before feeding it into nomic-embed-text. This helps remove unnecessary information and reduce the input token count for the model to process.

Another note is that although these models have free APIs, they always come with limitations. Processing data for the first time is very expensive, as I have over 400 articles in both Vietnamese and English. A better approach is to run the Llama 3.2 3B and nomic-embed-text models locally. I use LM Studio for this.

The search logic is simple. Take the user's query -> pass it through nomic-embed-text to convert to a vector -> query cosin with the article vector and sort by the closest distance between the two vectors.

However, if the user searches for keywords like node.js, javascript, etc., it's likely that semantic search won't return results because the data is too short, and the generated vector doesn't contain enough meaning, making the cosine distance too large. Therefore, to handle this case, I need to maintain a fulltext search mechanism. Fortunately, Supabase supports this type of search.

Challenges

Looking back, it seems simple, but the most challenging part for me was the data preprocessing steps.

An article usually conveys multiple ideas, including main and secondary content. Typically, searchers are only interested in the main content of the article, and they tend to search for related things. If I convert the entire article content into a vector, it will be "diluted" or "noisy" because the vector size is limited. I think that if I can remove secondary information and emphasize the main idea, the search will be more accurate. Imagine an article with 1500 words converted into a 1024-dimensional vector, compared to an article with only the main content of 500 words in the same vector. Which one represents the data more "clearly"?

Users' search patterns are also hard to predict because everyone searches differently. Some people like to keep it short, while others like to write longer or provide context for their questions... Therefore, processing user input data is also a challenge. How can I convert it into a concise and relevant query that matches the search content on the blog?

The quality of the AI model used is also an issue. Generally, the more trained a model is, the better it is, and commercial models come with quality assurance. However, to minimize costs, I'm currently using free LLMs models with limitations. Hopefully, one day I'll be able to integrate more powerful models to improve the search quality for my blog.

Premium
Hello

Me & the desire to "play with words"

Have you tried writing? And then failed or not satisfied? At 2coffee.dev we have had a hard time with writing. Don't be discouraged, because now we have a way to help you. Click to become a member now!

Have you tried writing? And then failed or not satisfied? At 2coffee.dev we have had a hard time with writing. Don't be discouraged, because now we have a way to help you. Click to become a member now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...