Semantic Search Feature

Semantic Search Feature

Daily short news for you
  • A library brings a lot of motion effects to your website: animejs.com

    Go check it out, scroll a bit and your eyes will be dazzled 😵‍💫

    » Read more
  • A repository that compiles a list of system prompts that have been "leaked" on the Internet. Very useful for anyone researching how to write system prompts. I must say they are quite meticulous 😅

    jujumilk3/leaked-system-prompts

    » Read more
  • For over a week now, I haven't posted anything, not because I have nothing to write about, but because I'm looking for ways to distribute more valuable content in this rapidly exploding AI era.

    As I shared earlier this year, the number of visitors to my blog is gradually declining. When I looked at the statistics, the number of users in the first six months of 2025 has dropped by 30% compared to the same period last year, and by 15% compared to the last six months of 2024. This indicates a reality that users are gradually leaving. What is the reason for this?

    I think the biggest reason is that user habits have changed. They primarily discover the blog through search engines, with Google being the largest. Almost half of the users return to the blog without going through the search step. This is a positive signal, but it's still not enough to increase the number of new users. Not to mention that now, Google has launched the AI Search Labs feature, which means AI displays summarized content when users search, further reducing the likelihood of users accessing the website. Interestingly, when Search Labs was introduced, English articles have taken over the rankings for the most accessed content.

    My articles are usually very long, sometimes reaching up to 2000 words. Writing such an article takes a lot of time. It's normal for many articles to go unread. I know and accept this because not everyone encounters the issues being discussed. For me, writing is a way to cultivate patience and thoughtfulness. Being able to help someone through my writing is a wonderful thing.

    Therefore, I am thinking of focusing on shorter and medium-length content to be able to write more. Long content will only be used when I want to write in detail or delve deeply into a particular topic. So, I am looking for ways to redesign the blog. Everyone, please stay tuned! 😄

    » Read more

Problem

Hi there, 2coffee.dev readers. Have you noticed that the autumn atmosphere has become clearer in Hanoi recently? The morning was cool, and the evening came with strong winds. But behind that was a busy week for me. I was focused on running the "deadline" for my company's project, and in the evening, I tried to complete the search function for my blog. This deadline was different from usual because it was the main feature for the year for the product. And as for the blog, the search function had to be completed sooner or later, and this was the perfect time to do it.

Before switching to Fresh, my blog already had a search function. The way I did it back then was to use Postgres's fulltext-search. For those who don't know, before using Postgres, I also used redisearch for search. Generally speaking, Postgres gave better results, while redisearch was more complex. But in reality, my blog data wasn't that massive, so redisearch didn't have a chance to shine.

When I switched to Fresh, AI was booming. Many people were talking about AI and what it could do. After completing the basic features and getting ready to work on the search function, I thought, "Why not try using AI?" So, I decided to "release" the new blog version without the search function.

To create a search function with AI, I had to spend a lot of time researching and experimenting. I learned about how to implement it, how to use LLMs models, embeddings models, vector data types, how to convert data to vectors, and how to query...

To put it simply, a vector is a finite set of numbers, like in mathematics. The number of elements in the set determines the size (dimension) of the vector. The larger the size, the more the vector can generalize the data it represents. To convert regular data (text, speech, images, etc.) to vectors, there are many ways, but thanks to the popularity of LLMs today, you can just put the data into an embeddings model, and it will give you vector data.

Semantic search (semantic search) is different from traditional fulltext keyword search. Fulltext search is based on the amount of text characters entered to match and return the most relevant results. Meanwhile, semantic search is based on the content. Suppose your article is explaining how node.js works. When searching for the phrase "node.js hoạt động như thế nào?" (node.js how it works), semantic search can find the article. On the other hand, fulltext search will try to find articles containing the words "node.js", " 活 động", "như", ...

To query vector data, you need at least two steps. First, convert the query into a vector, then use the query functions. For example, with pg-vector - a Postgres extension that supports vectors - there are query functions like:

pg-vector search

You can see L2 distance, Cosine distance, L1 distance... as vector comparison methods. Depending on the use case, you choose the query type accordingly. For example, in the search problem, I chose the Cosine distance method - that is, the two vectors should have a similar shape.

How to do it

Flow

First, choose a suitable database. I'm using Turso as my main database. However, Turso is based on SQLite, which isn't optimized for vector data. Although they introduced an extension to support vectors, it's a bit complicated.

pg-vector is the opposite. It's widely used and is a Postgres extension. When it comes to Postgres, I think of Supabase, which offers free usage. Supabase has pg-vector integrated, and activation is just a click away, making it a great choice.

Next is choosing models. To save costs, I've been looking for free models from the start. I couldn't help but mention groq with its Completions API. However, groq doesn't have embeddings models, so I had to find another one.

nomic-embed-text is an embeddings model I found in Ollama's library. It can vectorize text. Additionally, Nomic provides a free embeddings API with limitations. However, I should remind you that Nomic isn't a multilingual model. It supports Vietnamese to a limited extent, so the generated vector might not be optimal for Vietnamese semantics.

After preparing everything, it's time to write code to add vector data and search logic.

First, convert the article content into a vector and store it in Supabase. Instead of converting the entire article content, I summarize the main content of the article before feeding it into nomic-embed-text. This helps remove unnecessary information and reduce the input token count for the model to process.

Another note is that although these models have free APIs, they always come with limitations. Processing data for the first time is very expensive, as I have over 400 articles in both Vietnamese and English. A better approach is to run the Llama 3.2 3B and nomic-embed-text models locally. I use LM Studio for this.

The search logic is simple. Take the user's query -> pass it through nomic-embed-text to convert to a vector -> query cosin with the article vector and sort by the closest distance between the two vectors.

However, if the user searches for keywords like node.js, javascript, etc., it's likely that semantic search won't return results because the data is too short, and the generated vector doesn't contain enough meaning, making the cosine distance too large. Therefore, to handle this case, I need to maintain a fulltext search mechanism. Fortunately, Supabase supports this type of search.

Challenges

Looking back, it seems simple, but the most challenging part for me was the data preprocessing steps.

An article usually conveys multiple ideas, including main and secondary content. Typically, searchers are only interested in the main content of the article, and they tend to search for related things. If I convert the entire article content into a vector, it will be "diluted" or "noisy" because the vector size is limited. I think that if I can remove secondary information and emphasize the main idea, the search will be more accurate. Imagine an article with 1500 words converted into a 1024-dimensional vector, compared to an article with only the main content of 500 words in the same vector. Which one represents the data more "clearly"?

Users' search patterns are also hard to predict because everyone searches differently. Some people like to keep it short, while others like to write longer or provide context for their questions... Therefore, processing user input data is also a challenge. How can I convert it into a concise and relevant query that matches the search content on the blog?

The quality of the AI model used is also an issue. Generally, the more trained a model is, the better it is, and commercial models come with quality assurance. However, to minimize costs, I'm currently using free LLMs models with limitations. Hopefully, one day I'll be able to integrate more powerful models to improve the search quality for my blog.

Premium
Hello

Me & the desire to "play with words"

Have you tried writing? And then failed or not satisfied? At 2coffee.dev we have had a hard time with writing. Don't be discouraged, because now we have a way to help you. Click to become a member now!

Have you tried writing? And then failed or not satisfied? At 2coffee.dev we have had a hard time with writing. Don't be discouraged, because now we have a way to help you. Click to become a member now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...