Vietnamese Spell Checking Problem - Part 1

Articles in series:

Vietnamese Spell Checking Problem - Part 1

Daily short news for you

Turso is preparing to launch the first Alpha version of the rewritten SQLite in Rust. They are so confident that they are offering rewards to anyone who finds a bug.

As for the reason for rewriting SQLite in Rust, readers can find it clearly stated in the article 😄

Introducing the first alpha of Turso: The next evolution of SQLite

» Read more
Finally, Windsurf has returned to Cognition.ai. For those who may not know, Cognition has a product called devin.ai - a type of AI Agent that they refer to as a "software engineer." In other words, it can work like an independent programmer.

Cognition’s acquisition of Windsurf

» Read more
I just found this website that compiles the free tiers of popular services like Google, AWS, Microsoft... There are also many others, so feel free to click and explore, you might find something you need there.

freetier.co

» Read more

The Problem

In the role of a writer, the hardest thing I find is how to use words to express the ideas I am about to write. From the moment I think to the moment I place my hands on the keyboard, sometimes I don't know what to type, how to type to convey what I want to say. Even though I know what the idea is, figuring out how to write sentences fluently so that anyone reading can understand is indeed very challenging.

Typically, what I often do is outline what I want to say. Then from those ideas, I write into paragraphs. It's not necessary to write well from the start; I just need to write down what I'm thinking, any sentence or word that comes to my mind should be recorded, and only after finishing everything should I go back and refine it. Several times, even dozens of times rewriting may occur before I am satisfied. However, it's funny that with that same article, a few days later when I read it again, I still find it lacking and want to revise it further. But not every article has the time to be revised because if I keep repeating this process, I won't be able to publish any more new articles. Being overwhelmed with the drafts I have written creates an endless loop.

Another problem that closely follows writing is spell checking. Spell checking is easy, isn't it? Just read the article thoroughly to detect mistakes, right? If you think like that, congratulations, you have very good language control. Unfortunately for me, I am not that lucky. No matter how many times I read it multiple times, there is still a chance of errors somewhere. If no one points them out, I will never know. Occasionally, a few spelling mistakes are identified thanks to a random re-reading process that I often apply to my articles.

I have often sought help online. I looked for ways to check the spelling of Vietnamese for newly written articles. At first, I found websites that allow text input for checking, but the results were disappointing, so I stopped using them. Many people guide using the default spell checker of the browser or operating system, but all were ineffective, adding more complexity, so I gave up. I even sought help from large language models (LLMs), but they were also helpless in the face of the amount of input data. Day by day, I continued to write articles, but the mistakes remained, causing me to feel distressed.

One day, while randomly browsing the internet, I found the project underthesea introduced as a natural language processing toolkit for Vietnamese. This is an open-source project, seemingly developed by Vietnamese programmers. It looked credible, so I took a look to see what it could do!

After a few hours of research, this library provides functions to process Vietnamese content. For example, it can break text into separate sentences, into individual words, or analyze grammatical structures... At this point, I tried to find out if the library had any function for spell checking, but unfortunately, it did not. It seems spell checking remains a challenging problem. Just when I thought to stop there, a bold idea suddenly came to mind: If the library can segment the text into individual words, and then those words are compared to a Vietnamese dictionary, what would happen? That's right! If a word is not in the Vietnamese dictionary, the likelihood of it being a spelling mistake increases.

For example, the sentence "The 9X boy from Quang Tri started a business with mushroom," after passing through the word_tokenize function, would be segmented into ["The 9X boy", "from", "Quang Tri", "started", "a", "business", "with", "mushroom"], at this point, I just need to compare those words with the dictionary, and that's it.

Thinking leads to action; I immediately started experimenting with this grand idea. At the time of writing this article, I have basically done it and proven that it is effective. However, this process was not smooth. I take the opportunity to write this article for archival purposes, both for reference and to introduce it to readers because who knows, someone might come up with a better solution!

Problem Analysis

First, let's take some time to analyze the problem that needs to be solved. The ultimate goal is to identify misspelled words in the article. Thus, the input is the content, and the output is the list of misspelled words, then check if they are correct or not. But what constitutes a spelling mistake?

There are many reasons leading to spelling errors, such as in the following cases.

Syllable errors. For example, you want to write the word "chưa" but end up typing "chuwa," or "không" becomes "khôong"... These mistakes can be easily recognized if you read carefully.
Consonant errors. For example, "đến trường đi học" might be written as "đến chường đi học," "xuất sắc" becomes "suất xắc"... These errors are somewhat more difficult to detect through regular means.
Tone confusion (dấu ?/~). This type of error is probably the most common among frequent mistakes or those made. For example, "chẳng lẽ" might be written as "chẵng lẽ"...
Dialectal errors can be due to regional characteristics. For example, "biết" might be written as "biếc," "sân" as "sâng"...
Incorrect word usage. For example, "chín muồi" might be incorrectly written as "chín mùi," "đi tham quan" becomes "đi thăm quan"... Generally, this is the hardest error to detect because it relates closely to the writer's vocabulary.
Additionally, there are other errors like improper formatting, for example, "trời đổ những hạt mưa li ti xuống mặt đất ,bỗng...". Not capitalizing proper nouns, or specific nouns...

Analyzing up to this point is sufficient to see that the spelling checking problem is not simple at all; there are many potential errors that can occur during the writing process, and the ways to identify these mistakes range from simple to significantly difficult. If I apply this to my reference frame, the errors related to incorrect word usage and tone confusion occur more frequently, and those are all hard-to-detect errors. To avoid them, the best way is still to consult the dictionary for each word written.

Therefore, after finding the project underthesea, I had the idea to create a spell-checking tool based on the dictionary without hesitation.

Thoughts on Implementation

Speaking about the implementation, I have envisioned several directions for developing this spell-checking tool from the beginning.

First, underthesea is written in Python, so I need to know Python to use it. Although Python is known as one of the easiest programming languages to learn, I haven't had much exposure to it, so I might need some time to read the basics. The undeniable truth is that technology today has advanced significantly compared to the past. With the help of many AI Generator tools, language barriers have been considerably reduced. What programmers should focus on now is thinking rather than the syntax of the language; AI will be a powerful right hand to help with the rest.

Regarding the architecture, the idea is to build a "core" in Python using underthesea, outputting the necessary APIs. The processing, such as automatic error correction or interaction with data, will be written in a more familiar language like JavaScript. This architecture will facilitate faster and easier application deployment. Imagine the "core" as a server while JavaScript is the client communicating with the Python server.

The Vietnamese dictionary is vast and rich; it is essential to find a dictionary that is as complete and detailed as possible. After browsing GitHub, I found a few repositories with these dictionaries, but upon testing, I realized many words were still missing. While struggling, I discovered that the underthesea library contains a dictionary file with over 74k words. Wow, that’s wonderful. What are we waiting for?

One more limitation is that the word_tokenize function, which is used to segment sentences into meaningful words, does not always segment words as expected. It can break them into phrases that do not make sense but would make more sense if another word were present. Fortunately, word_tokenize supports the fixed_words parameter to remedy this. It is necessary to list the phrases that we do not want to be segmented to create meaningful phrases, putting them into the fixed_words array, so the tool segments based on the defined phrases.

Ultimately, the spell-checking "core" is a Python file that receives data through a pipe and outputs a list of potentially misspelled words. For example.

"Con ngựa đá con ngựa đá" | python3 index.py

Then it can be combined with the cat command to take the article content as input.

cat articles/bai-viet.md | python3 index.py

This "core" needs to learn additional new words and meaningful phrases, so the solution is to create two files containing new words and meaningful phrases. When running, the "core" will load these two contents and process them. The new content can be added to these two files either manually or using some automatic tool integrated into the CLI article management application of 2coffee.dev.

Conclusion

Thus, we have clearly defined our needs, analyzed common errors, and outlined a proposed approach. In the next part, I will present more detailed implementation steps. Stay tuned!

Premium

The secret stack of Blog

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

View all

Vietnamese Spell Checking Problem - Part 1

The Problem

Problem Analysis

Thoughts on Implementation

Conclusion

Upgrade to Premium

Premium

Premium Plus

Premium