Searching fulltext in redisearch

Searching fulltext in redisearch

Daily short news for you
  • How I wish I had discovered this repository earlier. github/opensource.guide is a place that guides everyone on everything about Open Source. From how to contribute code, how to start your own open-source project, to the knowledge that anyone should know when stepping into this field 🤓

    Especially, this content is directly from Github.

    » Read more
  • Just the other day, I mentioned dokploy.com and today I came across coolify.io - another open-source project that can replace Heroku/Netlify/Vercel.

    From what I've read, Coolify operates based on Docker deployment, which allows it to run most applications.

    Coolify offers an interface and features that make application deployment simpler and easier.

    Could this be the trend for application deployment in the future? 🤔

    » Read more
  • One of the things I really like about command lines is their 'pipeline' nature. You can imagine each command as a pipe; when connected together, they create a flow of data. The output of one pipe becomes the input of another... and so on.

    In terms of application, there are many examples; you can refer to the article Practical Data Processing Using Commands on MTTQVN Statement File. By combining commands, we turn them into powerful data analysis tools.

    Recently, I combined the wrangler command with jq to make it easier to view logs from the worker. wrangler is Cloudflare's command line interface (CLI) that integrates many features. One of them helps us view logs from Worker using the command:

    $ wrangler tail --config /path/to/wrangler.toml --format json

    However, the logs from the above command contain a lot of extraneous information, spilling over the screen, while we only want to see a few important fields. So, what should we do?

    Let’s combine it with jq. jq is a very powerful JSON processing command. It makes working with JSON data in the terminal much easier. Therefore, to filter information from the logs, it’s quite simple:

    $ wrangler tail --config /path/to/wrangler.toml --format json | jq '{method: .event.request.method, url: .event.request.url, logs }'

    The above command returns structured JSON logs consisting of only 3 fields: method, url, and logs 🔥

    » Read more

Problem

The problem of searching data in the fastest and most accurate way has always been a timeless question for developers. Depending on the purpose, the problem, and the available resources, we can choose different tools and methods.

For example, when working with a small dataset, you can use the LIKE operator in SQL. However, when the data grows larger, LIKE is no longer an optimal method. At that point, you can switch to using the fulltext search modules available in the database system you are using. However, these modules are only temporary solutions as they may not provide all the powerful features compared to other fulltext search tools available in the world.

Elasticsearch and Apache Solr are two very powerful libraries that are widely used by the community. However, they require "modest" hardware and are not easily accessible for those who develop projects with limited budgets.

Redisearch, based on Redis, provides a very powerful search engine with minimal resource consumption, which you can easily integrate into your projects. For those who don't know, Redis is a key-value database where data is stored in random-access memory (RAM) for fast access, often used as a cache. Although it is small, Redisearch is no less powerful than its predecessors. In this article, let's see what Redisearch can do.

Creating an index

The first thing to do is to create an index for searching. The index serves as a way to declare to the search engine how your data should be processed for optimal performance.

Creating an index is very simple in Redisearch. For example, I create an index for searching articles with three fields: title, content, and created_at, corresponding to the title, content, and creation date of the article.

FT.CREATE article ON HASH PREFIX 1 article: SCHEMA title TEXT WEIGHT 5.0 content TEXT created_at NUMERIC SORTABLE

My index is named "article". In the "title" field, I set WEIGHT = 5 to prioritize search results in the "title" field over the "content" field. "created_at" is declared as SORTABLE to enable sorting of search results. If SORTABLE is not declared, you won't be able to sort the search results.

Alright, after creating the index, let's learn how to search the data.

Search principles

Before getting into the search syntax, you need to know some search principles in Redisearch:

  • When searching for a phrase, for example "hello world", you are simply looking for sections that contain both words "hello" AND "world".
  • If you want to search for the exact phrase "hello world", you need to put it in double quotes (""), for example, "hello world".
  • When you want to search for the form that contains either "hello" OR "world", you separate them with the | character, for example hello|world.
  • When you want to search for NOT, you use the - character. For example, searching for items that contain "hello" but not "world" would be hello -world. You can also combine multiple NOT words by combining with OR, for example, searching in the title field for items that do not contain "hello" or "world": -@title:(hello|world).
  • By default, if no specific field is specified for searching, Redisearch will search in all fields of the index. To specify the field, you use the syntax @field:query, for example, @title:hello world.
  • Searching on a NUMERIC field must use the [min max] syntax.
  • Searching on a TAG field must use the {tag1 | tag2 | ...} syntax.
  • Fuzzy matching is a search suggestion feature. For example, when you type a word on Google, it suggests the next words. The syntax is %text.
  • ...

There are a few more principles that you can refer to at Search Query Syntax.

And finally, a holy cheatsheet to compare some data search commands between SQL and Redisearch:

Search syntax

First, let's add some data to Redis using the index "article" we created above. For simplicity and better visibility, I will add some small data for easy observation.

HSET article:1 url "url-1" title "article number one" content "content of article number one" created_at 1630245601
HSET article:2 url "url-2" title "article number two" content "content of article number two" created_at 1630245602
HSET article:3 url "url-3" title "article number three" content "content of article number three" created_at 1630245603

Search for all records containing the term "article":

FT.SEARCH article "article"

Search for all records where the content contains the term "content":

FT.SEARCH article "@content:content"

Search for all records where the title contains the term "article" and the content does not contain the term "number one":

FT.SEARCH article "@title:article -@content:number one"

Search for all records where the title contains the terms "number one" or "number two" and the content does not contain the term "number three", sorted in descending order of "created_at":

FT.SEARCH article "@title:(number one | number two) -@content:number three" SORTBY created_at DESC

Stop Words

Stop words are terms that Redisearch will ignore in the search as they are too common and do not provide value in the search. For example, a, is, the... If these words are indexed, they take up a lot of storage space and consume CPU resources during search.

Because Redisearch is designed for all users, it only includes a default set of English stop words. However, you can translate them to Vietnamese and add them to the dictionary, or you can also add words that you don't want to use for search.

Stop words are declared when creating the index. In the example below, I'm adding 2 words "thì" and "là" to the stop words of the "article" index:

FT.CREATE article STOPWORDS 2 thì là ON HASH PREFIX 1 article: SCHEMA title TEXT WEIGHT 5.0 content TEXT created_at NUMERIC SORTABLE

Note: Since stop words must be added when creating the index, if you already have an index, you must delete it before adding them again. Use the FT.DROPINDEX command to delete the index. By default, when deleting an index, the data of the index is not deleted. Then we proceed to re-create the index as usual.

If you no longer want to use stop words, set STOPWORDS 0 in the index creation command.

Tokenization and Escaping

Tokenization and escaping are understood as encoding the input and query characters. The data when passed to Redisearch must go through a processing step, such as removing whitespace, special characters... Here are some tokenization rules in Redisearch:

  • Characters ,.<>{}[]"':;!@#$%^&*()-+=~ and whitespace (space) will break the text into tokens for indexing. For example, a text "hello-world...1" will be encoded as [hello world 1].
  • If you want to bypass the above rules, i.e., you want Redisearch to index special characters and whitespace, you need to add a backslash () before each special character. For example, if I want to include the phrase "hello-world" in Redisearch, I need to modify the text to hello\-world, and when searching, I also have to use hello\-world to search.
  • The underscore (_) character is not affected by tokenization and escaping.
  • Repeated whitespace and characters in section one are removed during the query. If you want to use them, you must add a backslash before them.
  • Latin characters (A-Z a-z) are converted to lowercase.

Those are some principles of the TEXT data field. For TAG data field, there are some differences which I will discuss in a future article.

Highlighting Result

The Highlighting API allows us to manipulate the discovered areas of data in Redisearch, such as inserting additional characters to highlight the results...

To wrap the search result in a HTML tag, for example, opening/closing tags around it, we use the HIGHLIGHT option:

FT.SEARCH article "article" HIGHLIGHT TAGS <b> </b>

The search result, if found in all fields, will be inserted into the <b> </b> tag. If you want to specify a specific field to use HIGHLIGHT, you can add the FIELDS option:

FT.SEARCH article "article" HIGHLIGHT FIELDS 1 title TAGS <b> </b>

In addition, Redisearch also supports displaying the context of the content we are searching for. For example, the original sentence "estacks is a programming blog", when searching for the word "blog", Redisearch will display "...estacks is a blog about programming...".

FT.SEARCH article "article" SUMMARIZE FIELDS 1 content

You can also combine both HIGHLIGHT and SUMMARIZE in one query.

Conclusion

Through this article, I hope you understand what Redisearch is used for, whether it is suitable or necessary for your upcoming projects, and the basic commands to get started. Keep learning new tools to have more ways to solve problems!

Premium
Hello

The secret stack of Blog

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...
Scroll or click to go to the next page