Vietnamese Spell Checking Problem - Part 1

Vietnamese Spell Checking Problem - Part 1

Daily short news for you
  • Well, I have officially launched the Store page on the blog 2coffee.dev everyone 🥳

    This is simply a collection of products that I have purchased and found to be good, suitable for their price, and aligned with my usage needs, and I want to share them with you readers. Feel free to check it out for fun. Initially, I don't have much time to edit the content. I will update it gradually. Thank you, everyone.

    » Read more
  • During the weekend, I'm sitting down to set up the store for everyone's relaxation. I did it once before, but it was just for the sake of doing it, and I actually sold a book 😆

    Now I'm doing it again, and there will be more diverse products. I plan to post some products I've bought and used along with a few lines of reviews for everyone to refer to 🤓

    » Read more
  • A short article on how to write blog posts that developers read.

    How to Write Blog Posts that Developers Read

    In summary, get straight to the point and envision the audience you are targeting. Another thing is that the author has over 9 years of writing experience; in the beginning, no one read his work, but persistence helped him reach 300K - 500K readers each year. That's quite an impressive number, isn't it? 🔥

    » Read more

The Problem

In the role of a writer, the hardest thing I find is how to use words to express the ideas I am about to write. From the moment I think to the moment I place my hands on the keyboard, sometimes I don't know what to type, how to type to convey what I want to say. Even though I know what the idea is, figuring out how to write sentences fluently so that anyone reading can understand is indeed very challenging.

Typically, what I often do is outline what I want to say. Then from those ideas, I write into paragraphs. It's not necessary to write well from the start; I just need to write down what I'm thinking, any sentence or word that comes to my mind should be recorded, and only after finishing everything should I go back and refine it. Several times, even dozens of times rewriting may occur before I am satisfied. However, it's funny that with that same article, a few days later when I read it again, I still find it lacking and want to revise it further. But not every article has the time to be revised because if I keep repeating this process, I won't be able to publish any more new articles. Being overwhelmed with the drafts I have written creates an endless loop.

Another problem that closely follows writing is spell checking. Spell checking is easy, isn't it? Just read the article thoroughly to detect mistakes, right? If you think like that, congratulations, you have very good language control. Unfortunately for me, I am not that lucky. No matter how many times I read it multiple times, there is still a chance of errors somewhere. If no one points them out, I will never know. Occasionally, a few spelling mistakes are identified thanks to a random re-reading process that I often apply to my articles.

I have often sought help online. I looked for ways to check the spelling of Vietnamese for newly written articles. At first, I found websites that allow text input for checking, but the results were disappointing, so I stopped using them. Many people guide using the default spell checker of the browser or operating system, but all were ineffective, adding more complexity, so I gave up. I even sought help from large language models (LLMs), but they were also helpless in the face of the amount of input data. Day by day, I continued to write articles, but the mistakes remained, causing me to feel distressed.

One day, while randomly browsing the internet, I found the project underthesea introduced as a natural language processing toolkit for Vietnamese. This is an open-source project, seemingly developed by Vietnamese programmers. It looked credible, so I took a look to see what it could do!

After a few hours of research, this library provides functions to process Vietnamese content. For example, it can break text into separate sentences, into individual words, or analyze grammatical structures... At this point, I tried to find out if the library had any function for spell checking, but unfortunately, it did not. It seems spell checking remains a challenging problem. Just when I thought to stop there, a bold idea suddenly came to mind: If the library can segment the text into individual words, and then those words are compared to a Vietnamese dictionary, what would happen? That's right! If a word is not in the Vietnamese dictionary, the likelihood of it being a spelling mistake increases.

For example, the sentence "The 9X boy from Quang Tri started a business with mushroom," after passing through the word_tokenize function, would be segmented into ["The 9X boy", "from", "Quang Tri", "started", "a", "business", "with", "mushroom"], at this point, I just need to compare those words with the dictionary, and that's it.

Thinking leads to action; I immediately started experimenting with this grand idea. At the time of writing this article, I have basically done it and proven that it is effective. However, this process was not smooth. I take the opportunity to write this article for archival purposes, both for reference and to introduce it to readers because who knows, someone might come up with a better solution!

Problem Analysis

First, let's take some time to analyze the problem that needs to be solved. The ultimate goal is to identify misspelled words in the article. Thus, the input is the content, and the output is the list of misspelled words, then check if they are correct or not. But what constitutes a spelling mistake?

There are many reasons leading to spelling errors, such as in the following cases.

  • Syllable errors. For example, you want to write the word "chưa" but end up typing "chuwa," or "không" becomes "khôong"... These mistakes can be easily recognized if you read carefully.

  • Consonant errors. For example, "đến trường đi học" might be written as "đến chường đi học," "xuất sắc" becomes "suất xắc"... These errors are somewhat more difficult to detect through regular means.

  • Tone confusion (dấu ?/~). This type of error is probably the most common among frequent mistakes or those made. For example, "chẳng lẽ" might be written as "chẵng lẽ"...

  • Dialectal errors can be due to regional characteristics. For example, "biết" might be written as "biếc," "sân" as "sâng"...

  • Incorrect word usage. For example, "chín muồi" might be incorrectly written as "chín mùi," "đi tham quan" becomes "đi thăm quan"... Generally, this is the hardest error to detect because it relates closely to the writer's vocabulary.

  • Additionally, there are other errors like improper formatting, for example, "trời đổ những hạt mưa li ti xuống mặt đất ,bỗng...". Not capitalizing proper nouns, or specific nouns...

Analyzing up to this point is sufficient to see that the spelling checking problem is not simple at all; there are many potential errors that can occur during the writing process, and the ways to identify these mistakes range from simple to significantly difficult. If I apply this to my reference frame, the errors related to incorrect word usage and tone confusion occur more frequently, and those are all hard-to-detect errors. To avoid them, the best way is still to consult the dictionary for each word written.

Therefore, after finding the project underthesea, I had the idea to create a spell-checking tool based on the dictionary without hesitation.

Thoughts on Implementation

Speaking about the implementation, I have envisioned several directions for developing this spell-checking tool from the beginning.

First, underthesea is written in Python, so I need to know Python to use it. Although Python is known as one of the easiest programming languages to learn, I haven't had much exposure to it, so I might need some time to read the basics. The undeniable truth is that technology today has advanced significantly compared to the past. With the help of many AI Generator tools, language barriers have been considerably reduced. What programmers should focus on now is thinking rather than the syntax of the language; AI will be a powerful right hand to help with the rest.

Regarding the architecture, the idea is to build a "core" in Python using underthesea, outputting the necessary APIs. The processing, such as automatic error correction or interaction with data, will be written in a more familiar language like JavaScript. This architecture will facilitate faster and easier application deployment. Imagine the "core" as a server while JavaScript is the client communicating with the Python server.

The Vietnamese dictionary is vast and rich; it is essential to find a dictionary that is as complete and detailed as possible. After browsing GitHub, I found a few repositories with these dictionaries, but upon testing, I realized many words were still missing. While struggling, I discovered that the underthesea library contains a dictionary file with over 74k words. Wow, that’s wonderful. What are we waiting for?

One more limitation is that the word_tokenize function, which is used to segment sentences into meaningful words, does not always segment words as expected. It can break them into phrases that do not make sense but would make more sense if another word were present. Fortunately, word_tokenize supports the fixed_words parameter to remedy this. It is necessary to list the phrases that we do not want to be segmented to create meaningful phrases, putting them into the fixed_words array, so the tool segments based on the defined phrases.

Ultimately, the spell-checking "core" is a Python file that receives data through a pipe and outputs a list of potentially misspelled words. For example.

"Con ngựa đá con ngựa đá" | python3 index.py  

Then it can be combined with the cat command to take the article content as input.

cat articles/bai-viet.md | python3 index.py  

This "core" needs to learn additional new words and meaningful phrases, so the solution is to create two files containing new words and meaningful phrases. When running, the "core" will load these two contents and process them. The new content can be added to these two files either manually or using some automatic tool integrated into the CLI article management application of 2coffee.dev.

Conclusion

Thus, we have clearly defined our needs, analyzed common errors, and outlined a proposed approach. In the next part, I will present more detailed implementation steps. Stay tuned!

Premium
Hello

The secret stack of Blog

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...