Practicing Data Processing by Using Commands on the MTTQVN Ledger File

Practicing Data Processing by Using Commands on the MTTQVN Ledger File

Daily short news for you
  • For over a week now, I haven't posted anything, not because I have nothing to write about, but because I'm looking for ways to distribute more valuable content in this rapidly exploding AI era.

    As I shared earlier this year, the number of visitors to my blog is gradually declining. When I looked at the statistics, the number of users in the first six months of 2025 has dropped by 30% compared to the same period last year, and by 15% compared to the last six months of 2024. This indicates a reality that users are gradually leaving. What is the reason for this?

    I think the biggest reason is that user habits have changed. They primarily discover the blog through search engines, with Google being the largest. Almost half of the users return to the blog without going through the search step. This is a positive signal, but it's still not enough to increase the number of new users. Not to mention that now, Google has launched the AI Search Labs feature, which means AI displays summarized content when users search, further reducing the likelihood of users accessing the website. Interestingly, when Search Labs was introduced, English articles have taken over the rankings for the most accessed content.

    My articles are usually very long, sometimes reaching up to 2000 words. Writing such an article takes a lot of time. It's normal for many articles to go unread. I know and accept this because not everyone encounters the issues being discussed. For me, writing is a way to cultivate patience and thoughtfulness. Being able to help someone through my writing is a wonderful thing.

    Therefore, I am thinking of focusing on shorter and medium-length content to be able to write more. Long content will only be used when I want to write in detail or delve deeply into a particular topic. So, I am looking for ways to redesign the blog. Everyone, please stay tuned! 😄

    » Read more
  • CloudFlare has introduced the pay per crawl feature to charge for each time AI "crawls" data from your website. What does that mean 🤔?

    The purpose of SEO is to help search engines see the website. When users search for relevant content, your website appears in the search results. This is almost a win-win situation where Google helps more people discover your site, and in return, Google gets more users.

    Now, the game with AI Agents is different. AI Agents have to actively seek out information sources and conveniently "crawl" your data, then mix it up or do something with it that we can't even know. So this is almost a game that benefits only one side 🤔!?

    CloudFlare's move is to make AI Agents pay for each time they retrieve data from your website. If they don’t pay, then I won’t let them read my data. Something like that. Let’s wait a bit longer and see 🤓.

    » Read more
  • Continuing to update on the lawsuit between the Deno group and Oracle over the name JavaScript: It seems that Deno is at a disadvantage as the court has dismissed the Deno group's complaint. However, in August, they (Oracle) must be held accountable for each reason, acknowledging or denying the allegations presented by the Deno group in the lawsuit.

    JavaScript™ Trademark Update

    » Read more

The Problem

Recently, the Central Committee of the Vietnam Fatherland Front (MTTQVN) has uploaded 12,028 pages of ledgers of donations to support people affected by storm No. 3. Right after that, many lively discussions around this topic broke out on social media. And quickly, many people created websites to look up the ledger information. Just enter any content in the search box, click the button, wait a moment, and the found data will be displayed on the screen.

As we all know, the MTTQVN ledger files are in PDF format, so the servers of these websites must have gone through a pre-processing step. The easiest way I can think of is to extract the data into lines of text with structured content, put it into a database like SQL, and use query statements to search. That's how the problem is solved.

After trying to search for information on a website, it seemed like the number of visitors at that time was very high, so the server constantly reported a 500 error. Hmm, for someone who loves speed, this was a terrible experience. Naturally, an idea popped up in my head: I will create a search tool for this data right on my computer, won't it be faster and more accurate?

I thought about it and started searching for the ledger file. It was stored on Google Drive. I was delighted to click the download button, but what a surprise when Google warned that this file had been downloaded too many times, and I had to wait 24 hours to download it!? And even then, would I still have the enthusiasm to do it?

But somehow, someone must have downloaded the ledger file, and they would upload it somewhere. I searched for the name of the ledger file, and finally, I downloaded it. However, extracting information from a PDF file is not easy, so it needs to be converted to another format like CSV. CSV is perfect in many data processing cases.

I used a few online conversion tools, but they didn't seem accurate enough. I was about to write a piece of code to extract the information when I thought: Maybe someone has done it, so I don't have to waste time doing it again. And indeed, after a few search steps, toidicodedao shared the CSV ledger file in a Facebook post. I sent a deep thank you and quickly downloaded it.

Alright, the data is complete. From now on, let's start exploring the data together. But not by putting it into a database and executing query statements, but by using a simpler way: leveraging the power of Linux commands.

However, I hope that you, the reader, already know some useful command-line tools in Linux like cat, grep, head, tail, sort... and some advanced commands like sed and awk. You don't even need to know how to use them, just knowing what they are used for is enough. Because later, if you're interested, you'll learn how to use them.

Practice

First, let's try searching for a transaction by someone's name. The downloaded file is named transactions.csv. Here, I try searching for the name "tran thi thuy linh" using cat and grep.

$ cat transactions.csv | grep "tran thi thuy linh"

No results are found, simply because grep distinguishes between uppercase and lowercase letters. To ignore the case, add the -i flag to grep:

$ cat transactions.csv | grep -i "tran thi thuy linh"

However, this is not the best way; instead, use grep directly.

$ grep -i "tran thi thuy linh" transactions.csv

grep ledger

Next, let's count the number of ledger lines:

$ awk '{count++} END {print count}' transactions.csv

However, this number includes the header line of the CSV file, so subtract 1 to get the actual number of transactions. Alternatively, you can process it with a command.

$ awk 'NR > 1 {count++} END {print count}' transactions.csv

awk is a powerful data processing command. It can be considered a programming language for data. Because you can write logic within awk to do what you want.

Count the total amount of donations:

$ awk -F, 'NR > 1 {sum += $2} END {print sum}' transactions.csv

Count the total amount of donations on September 10, 2024:

$ grep "10/09/2024" transactions.csv | awk -F, '{sum += $2} END {print sum}'

One of the things I love about Linux commands is their "pipeline" property, meaning the output of one command becomes the input of another. Connecting commands creates a seamless processing process.

Sort the list of transactions by amount in descending order and return only the first 100 lines for easy observation.

$ sort -t, -k2 -nr transactions.csv | head -n 100

Reverse the results.

$ sort -t, -k2 -n transactions.csv | head -n 100

Add a column with line numbers to the ledger content.

$ awk 'NR==1 {print "\"STT\"," $0} NR>1 {print NR-1 "," $0}' transactions.csv | head -n10

Add a column with the account number at the end, where the account number is extracted from the transaction content (if present).

$ awk -F, '
BEGIN {OFS=","}
NR==1 {print "STT," $0 ",\"Account Number\""; next}
{
    match($3, /tu ([0-9]{7,})/, m)
    account_number = (RSTART > 0) ? m[1] : ""
    print NR-1, $0, account_number
}' transactions.csv

And there are many more examples. Please read the documentation or ask ChatGPT to write commands for your needs.

The above examples are what I used to extract a few pieces of information to satisfy my curiosity. I want to emphasize the power of Linux data processing commands. We can combine these commands to find the information we want.

Premium
Hello

The secret stack of Blog

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...