Searching fulltext in redisearch

Searching fulltext in redisearch

Threads
  • Good news to start the day. GitHub has just widely announced GitHub Models to everyone. If you remember, more than 2 months ago, GitHub had a trial program for using LLMs models, and in my case, it took a month to get approved for use. Now, they have given everyone with a GitHub account access, no registration needed anymore 🥳

    GitHub Models is currently a lifesaver for me while building this blog 😆

    GitHub Models is now available in public preview | Github Blog

    » Read more
  • I came across a repository that uses Tauri and Svelte to rewrite an application like the Task Manager on Windows or the Monitor on Mac. I was curious, so I downloaded it and was surprised to find that the app is only a few MB in size and loads quickly. The app itself is also very smooth

    » Read more
  • I've noticed that whenever I'm enthusiastic about reading, I tend to be lazy about writing. This week, I'm reading three books at the same time, or rather, two and listening to one.

    The most haunting book so far is 'The Black Ocean' - a collection of 12 stories about people struggling with depression. I have a strong mental fortitude, but after reading just two stories, I felt suffocated and restless.

    The next story brought some relief, as the protagonist managed to control their emotions. However, as I continued reading, I felt like I was being choked again. It's terrifying, and I couldn't close my eyes while listening.

    One sentence that particularly resonated with me is when the parents of someone struggling with depression ask why they're like that, and the person responds, 'How am I supposed to know? It's like asking someone why they're sick. Nobody wants to be like that!'

    » Read more

Problem

The problem of searching data in the fastest and most accurate way has always been a timeless question for developers. Depending on the purpose, the problem, and the available resources, we can choose different tools and methods.

For example, when working with a small dataset, you can use the LIKE operator in SQL. However, when the data grows larger, LIKE is no longer an optimal method. At that point, you can switch to using the fulltext search modules available in the database system you are using. However, these modules are only temporary solutions as they may not provide all the powerful features compared to other fulltext search tools available in the world.

Elasticsearch and Apache Solr are two very powerful libraries that are widely used by the community. However, they require "modest" hardware and are not easily accessible for those who develop projects with limited budgets.

Redisearch, based on Redis, provides a very powerful search engine with minimal resource consumption, which you can easily integrate into your projects. For those who don't know, Redis is a key-value database where data is stored in random-access memory (RAM) for fast access, often used as a cache. Although it is small, Redisearch is no less powerful than its predecessors. In this article, let's see what Redisearch can do.

Creating an index

The first thing to do is to create an index for searching. The index serves as a way to declare to the search engine how your data should be processed for optimal performance.

Creating an index is very simple in Redisearch. For example, I create an index for searching articles with three fields: title, content, and created_at, corresponding to the title, content, and creation date of the article.

FT.CREATE article ON HASH PREFIX 1 article: SCHEMA title TEXT WEIGHT 5.0 content TEXT created_at NUMERIC SORTABLE

My index is named "article". In the "title" field, I set WEIGHT = 5 to prioritize search results in the "title" field over the "content" field. "created_at" is declared as SORTABLE to enable sorting of search results. If SORTABLE is not declared, you won't be able to sort the search results.

Alright, after creating the index, let's learn how to search the data.

Search principles

Before getting into the search syntax, you need to know some search principles in Redisearch:

  • When searching for a phrase, for example "hello world", you are simply looking for sections that contain both words "hello" AND "world".
  • If you want to search for the exact phrase "hello world", you need to put it in double quotes (""), for example, "hello world".
  • When you want to search for the form that contains either "hello" OR "world", you separate them with the | character, for example hello|world.
  • When you want to search for NOT, you use the - character. For example, searching for items that contain "hello" but not "world" would be hello -world. You can also combine multiple NOT words by combining with OR, for example, searching in the title field for items that do not contain "hello" or "world": -@title:(hello|world).
  • By default, if no specific field is specified for searching, Redisearch will search in all fields of the index. To specify the field, you use the syntax @field:query, for example, @title:hello world.
  • Searching on a NUMERIC field must use the [min max] syntax.
  • Searching on a TAG field must use the {tag1 | tag2 | ...} syntax.
  • Fuzzy matching is a search suggestion feature. For example, when you type a word on Google, it suggests the next words. The syntax is %text.
  • ...

There are a few more principles that you can refer to at Search Query Syntax.

And finally, a holy cheatsheet to compare some data search commands between SQL and Redisearch:

Search syntax

First, let's add some data to Redis using the index "article" we created above. For simplicity and better visibility, I will add some small data for easy observation.

HSET article:1 url "url-1" title "article number one" content "content of article number one" created_at 1630245601
HSET article:2 url "url-2" title "article number two" content "content of article number two" created_at 1630245602
HSET article:3 url "url-3" title "article number three" content "content of article number three" created_at 1630245603

Search for all records containing the term "article":

FT.SEARCH article "article"

Search for all records where the content contains the term "content":

FT.SEARCH article "@content:content"

Search for all records where the title contains the term "article" and the content does not contain the term "number one":

FT.SEARCH article "@title:article -@content:number one"

Search for all records where the title contains the terms "number one" or "number two" and the content does not contain the term "number three", sorted in descending order of "created_at":

FT.SEARCH article "@title:(number one | number two) -@content:number three" SORTBY created_at DESC

Stop Words

Stop words are terms that Redisearch will ignore in the search as they are too common and do not provide value in the search. For example, a, is, the... If these words are indexed, they take up a lot of storage space and consume CPU resources during search.

Because Redisearch is designed for all users, it only includes a default set of English stop words. However, you can translate them to Vietnamese and add them to the dictionary, or you can also add words that you don't want to use for search.

Stop words are declared when creating the index. In the example below, I'm adding 2 words "thì" and "là" to the stop words of the "article" index:

FT.CREATE article STOPWORDS 2 thì là ON HASH PREFIX 1 article: SCHEMA title TEXT WEIGHT 5.0 content TEXT created_at NUMERIC SORTABLE

Note: Since stop words must be added when creating the index, if you already have an index, you must delete it before adding them again. Use the FT.DROPINDEX command to delete the index. By default, when deleting an index, the data of the index is not deleted. Then we proceed to re-create the index as usual.

If you no longer want to use stop words, set STOPWORDS 0 in the index creation command.

Tokenization and Escaping

Tokenization and escaping are understood as encoding the input and query characters. The data when passed to Redisearch must go through a processing step, such as removing whitespace, special characters... Here are some tokenization rules in Redisearch:

  • Characters ,.<>{}[]"':;!@#$%^&*()-+=~ and whitespace (space) will break the text into tokens for indexing. For example, a text "hello-world...1" will be encoded as [hello world 1].
  • If you want to bypass the above rules, i.e., you want Redisearch to index special characters and whitespace, you need to add a backslash () before each special character. For example, if I want to include the phrase "hello-world" in Redisearch, I need to modify the text to hello\-world, and when searching, I also have to use hello\-world to search.
  • The underscore (_) character is not affected by tokenization and escaping.
  • Repeated whitespace and characters in section one are removed during the query. If you want to use them, you must add a backslash before them.
  • Latin characters (A-Z a-z) are converted to lowercase.

Those are some principles of the TEXT data field. For TAG data field, there are some differences which I will discuss in a future article.

Highlighting Result

The Highlighting API allows us to manipulate the discovered areas of data in Redisearch, such as inserting additional characters to highlight the results...

To wrap the search result in a HTML tag, for example, opening/closing tags around it, we use the HIGHLIGHT option:

FT.SEARCH article "article" HIGHLIGHT TAGS <b> </b>

The search result, if found in all fields, will be inserted into the <b> </b> tag. If you want to specify a specific field to use HIGHLIGHT, you can add the FIELDS option:

FT.SEARCH article "article" HIGHLIGHT FIELDS 1 title TAGS <b> </b>

In addition, Redisearch also supports displaying the context of the content we are searching for. For example, the original sentence "estacks is a programming blog", when searching for the word "blog", Redisearch will display "...estacks is a blog about programming...".

FT.SEARCH article "article" SUMMARIZE FIELDS 1 content

You can also combine both HIGHLIGHT and SUMMARIZE in one query.

Conclusion

Through this article, I hope you understand what Redisearch is used for, whether it is suitable or necessary for your upcoming projects, and the basic commands to get started. Keep learning new tools to have more ways to solve problems!

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.
Author

Hello, my name is Hoai - a developer who tells stories through writing ✍️ and creating products 🚀. With many years of programming experience, I have contributed to various products that bring value to users at my workplace as well as to myself. My hobbies include reading, writing, and researching... I created this blog with the mission of delivering quality articles to the readers of 2coffee.dev.Follow me through these channels LinkedIn, Facebook, Instagram, Telegram.

Did you find this article helpful?
NoYes

Comments (0)

Leave a comment...