Semantic Search Feature

Semantic Search Feature

Daily short news for you
  • Privacy Guides is a non-profit project aimed at providing users with insights into privacy rights, while also recommending best practices or tools to help reclaim privacy in the world of the Internet.

    There are many great articles here, and I will take the example of three concepts that are often confused or misrepresented: Privacy, Security, and Anonymity. While many people who oppose privacy argue that a person does not need privacy if they have 'nothing to hide.' 'This is a dangerous misconception, as it creates the impression that those who demand privacy must be deviant, criminal, or wrongdoers.' - Why Privacy Matters.

    » Read more
  • There is a wonderful place to learn, or if you're stuck in the thought that there's nothing left to learn, then the comments over at Hacker News are just for you.

    Y Combinator - the company behind Hacker News focuses on venture capital investments for startups in Silicon Valley, so it’s no surprise that there are many brilliant minds commenting here. But their casual discussions provide us with keywords that can open up many new insights.

    Don't believe it? Just scroll a bit, click on a post that matches your interests, check out the comments, and don’t forget to grab a cup of coffee next to you ☕️

    » Read more
  • Just got played by my buddy Turso. The server suddenly crashed, and checking the logs revealed a lot of errors:

    Operation was blocked LibsqlError: PROXY_ERROR: error executing a request on the primary

    Suspicious, I went to the Turso admin panel and saw the statistics showing that I had executed over 500 million write commands!? At that moment, I was like, "What the heck? Am I being DDoSed? But there's no way I could have written 500 million."

    Turso offers users free monthly limits of 1 billion read requests and 25 million write requests, yet I had written over 500 million. Does that seem unreasonable to everyone? 😆. But the server was down, and should I really spend money to get it back online? Roughly calculating, 500M would cost about $500.

    After that, I went to the Discord channel seeking help, and very quickly someone came in to assist me, and just a few minutes later they informed me that the error was on their side and had restored the service for me. Truly, in the midst of misfortune, there’s good fortune; what I love most about this service is the quick support like this 🙏

    » Read more

Problem

Hi there, 2coffee.dev readers. Have you noticed that the autumn atmosphere has become clearer in Hanoi recently? The morning was cool, and the evening came with strong winds. But behind that was a busy week for me. I was focused on running the "deadline" for my company's project, and in the evening, I tried to complete the search function for my blog. This deadline was different from usual because it was the main feature for the year for the product. And as for the blog, the search function had to be completed sooner or later, and this was the perfect time to do it.

Before switching to Fresh, my blog already had a search function. The way I did it back then was to use Postgres's fulltext-search. For those who don't know, before using Postgres, I also used redisearch for search. Generally speaking, Postgres gave better results, while redisearch was more complex. But in reality, my blog data wasn't that massive, so redisearch didn't have a chance to shine.

When I switched to Fresh, AI was booming. Many people were talking about AI and what it could do. After completing the basic features and getting ready to work on the search function, I thought, "Why not try using AI?" So, I decided to "release" the new blog version without the search function.

To create a search function with AI, I had to spend a lot of time researching and experimenting. I learned about how to implement it, how to use LLMs models, embeddings models, vector data types, how to convert data to vectors, and how to query...

To put it simply, a vector is a finite set of numbers, like in mathematics. The number of elements in the set determines the size (dimension) of the vector. The larger the size, the more the vector can generalize the data it represents. To convert regular data (text, speech, images, etc.) to vectors, there are many ways, but thanks to the popularity of LLMs today, you can just put the data into an embeddings model, and it will give you vector data.

Semantic search (semantic search) is different from traditional fulltext keyword search. Fulltext search is based on the amount of text characters entered to match and return the most relevant results. Meanwhile, semantic search is based on the content. Suppose your article is explaining how node.js works. When searching for the phrase "node.js hoạt động như thế nào?" (node.js how it works), semantic search can find the article. On the other hand, fulltext search will try to find articles containing the words "node.js", " 活 động", "như", ...

To query vector data, you need at least two steps. First, convert the query into a vector, then use the query functions. For example, with pg-vector - a Postgres extension that supports vectors - there are query functions like:

pg-vector search

You can see L2 distance, Cosine distance, L1 distance... as vector comparison methods. Depending on the use case, you choose the query type accordingly. For example, in the search problem, I chose the Cosine distance method - that is, the two vectors should have a similar shape.

How to do it

Flow

First, choose a suitable database. I'm using Turso as my main database. However, Turso is based on SQLite, which isn't optimized for vector data. Although they introduced an extension to support vectors, it's a bit complicated.

pg-vector is the opposite. It's widely used and is a Postgres extension. When it comes to Postgres, I think of Supabase, which offers free usage. Supabase has pg-vector integrated, and activation is just a click away, making it a great choice.

Next is choosing models. To save costs, I've been looking for free models from the start. I couldn't help but mention groq with its Completions API. However, groq doesn't have embeddings models, so I had to find another one.

nomic-embed-text is an embeddings model I found in Ollama's library. It can vectorize text. Additionally, Nomic provides a free embeddings API with limitations. However, I should remind you that Nomic isn't a multilingual model. It supports Vietnamese to a limited extent, so the generated vector might not be optimal for Vietnamese semantics.

After preparing everything, it's time to write code to add vector data and search logic.

First, convert the article content into a vector and store it in Supabase. Instead of converting the entire article content, I summarize the main content of the article before feeding it into nomic-embed-text. This helps remove unnecessary information and reduce the input token count for the model to process.

Another note is that although these models have free APIs, they always come with limitations. Processing data for the first time is very expensive, as I have over 400 articles in both Vietnamese and English. A better approach is to run the Llama 3.2 3B and nomic-embed-text models locally. I use LM Studio for this.

The search logic is simple. Take the user's query -> pass it through nomic-embed-text to convert to a vector -> query cosin with the article vector and sort by the closest distance between the two vectors.

However, if the user searches for keywords like node.js, javascript, etc., it's likely that semantic search won't return results because the data is too short, and the generated vector doesn't contain enough meaning, making the cosine distance too large. Therefore, to handle this case, I need to maintain a fulltext search mechanism. Fortunately, Supabase supports this type of search.

Challenges

Looking back, it seems simple, but the most challenging part for me was the data preprocessing steps.

An article usually conveys multiple ideas, including main and secondary content. Typically, searchers are only interested in the main content of the article, and they tend to search for related things. If I convert the entire article content into a vector, it will be "diluted" or "noisy" because the vector size is limited. I think that if I can remove secondary information and emphasize the main idea, the search will be more accurate. Imagine an article with 1500 words converted into a 1024-dimensional vector, compared to an article with only the main content of 500 words in the same vector. Which one represents the data more "clearly"?

Users' search patterns are also hard to predict because everyone searches differently. Some people like to keep it short, while others like to write longer or provide context for their questions... Therefore, processing user input data is also a challenge. How can I convert it into a concise and relevant query that matches the search content on the blog?

The quality of the AI model used is also an issue. Generally, the more trained a model is, the better it is, and commercial models come with quality assurance. However, to minimize costs, I'm currently using free LLMs models with limitations. Hopefully, one day I'll be able to integrate more powerful models to improve the search quality for my blog.

Premium
Hello

5 profound lessons

Every product comes with stories. The success of others is an inspiration for many to follow. 5 lessons learned have changed me forever. How about you? Click now!

Every product comes with stories. The success of others is an inspiration for many to follow. 5 lessons learned have changed me forever. How about you? Click now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...