Practicing Data Processing by Using Commands on the MTTQVN Ledger File

Daily short news for you

Zed is probably the most user-centric developer community on the planet. Recently, they added an option to disable all AI features in Zed. While many others are looking to integrate deeper and do more with AI Agents. Truly a bold move 🤔

You Can Now Disable All AI Features in Zed

» Read more
Today I have tried to walk a full 8k steps in one session to show you all. As expected, the time spent walking reached over 1 hour and the distance was around 6km 🤓

Oh, in a few days it will be the end of the month, which means it will also mark one month since I started the habit of walking every day with the goal of 8k steps. At the beginning of next month, I will summarize and see how it goes.

» Read more
It's been a long time since I've read such a heartfelt article The many, many, many JavaScript runtimes of the last decade. In the article, the author recounts the journey of the development of the JavaScript runtime environment, noting that today we have and are seeing JavaScript present in many areas. Additionally, the author also touches on a lot of peripheral knowledge; just a little reading opens up so many new discoveries 🤓

» Read more

The Problem

Recently, the Central Committee of the Vietnam Fatherland Front (MTTQVN) has uploaded 12,028 pages of ledgers of donations to support people affected by storm No. 3. Right after that, many lively discussions around this topic broke out on social media. And quickly, many people created websites to look up the ledger information. Just enter any content in the search box, click the button, wait a moment, and the found data will be displayed on the screen.

As we all know, the MTTQVN ledger files are in PDF format, so the servers of these websites must have gone through a pre-processing step. The easiest way I can think of is to extract the data into lines of text with structured content, put it into a database like SQL, and use query statements to search. That's how the problem is solved.

After trying to search for information on a website, it seemed like the number of visitors at that time was very high, so the server constantly reported a 500 error. Hmm, for someone who loves speed, this was a terrible experience. Naturally, an idea popped up in my head: I will create a search tool for this data right on my computer, won't it be faster and more accurate?

I thought about it and started searching for the ledger file. It was stored on Google Drive. I was delighted to click the download button, but what a surprise when Google warned that this file had been downloaded too many times, and I had to wait 24 hours to download it!? And even then, would I still have the enthusiasm to do it?

But somehow, someone must have downloaded the ledger file, and they would upload it somewhere. I searched for the name of the ledger file, and finally, I downloaded it. However, extracting information from a PDF file is not easy, so it needs to be converted to another format like CSV. CSV is perfect in many data processing cases.

I used a few online conversion tools, but they didn't seem accurate enough. I was about to write a piece of code to extract the information when I thought: Maybe someone has done it, so I don't have to waste time doing it again. And indeed, after a few search steps, toidicodedao shared the CSV ledger file in a Facebook post. I sent a deep thank you and quickly downloaded it.

Alright, the data is complete. From now on, let's start exploring the data together. But not by putting it into a database and executing query statements, but by using a simpler way: leveraging the power of Linux commands.

However, I hope that you, the reader, already know some useful command-line tools in Linux like cat, grep, head, tail, sort... and some advanced commands like sed and awk. You don't even need to know how to use them, just knowing what they are used for is enough. Because later, if you're interested, you'll learn how to use them.

Practice

First, let's try searching for a transaction by someone's name. The downloaded file is named transactions.csv. Here, I try searching for the name "tran thi thuy linh" using cat and grep.

$ cat transactions.csv | grep "tran thi thuy linh"

No results are found, simply because grep distinguishes between uppercase and lowercase letters. To ignore the case, add the -i flag to grep:

$ cat transactions.csv | grep -i "tran thi thuy linh"

However, this is not the best way; instead, use grep directly.

$ grep -i "tran thi thuy linh" transactions.csv

Next, let's count the number of ledger lines:

$ awk '{count++} END {print count}' transactions.csv

However, this number includes the header line of the CSV file, so subtract 1 to get the actual number of transactions. Alternatively, you can process it with a command.

$ awk 'NR > 1 {count++} END {print count}' transactions.csv

awk is a powerful data processing command. It can be considered a programming language for data. Because you can write logic within awk to do what you want.

Count the total amount of donations:

$ awk -F, 'NR > 1 {sum += $2} END {print sum}' transactions.csv

Count the total amount of donations on September 10, 2024:

$ grep "10/09/2024" transactions.csv | awk -F, '{sum += $2} END {print sum}'

One of the things I love about Linux commands is their "pipeline" property, meaning the output of one command becomes the input of another. Connecting commands creates a seamless processing process.

Sort the list of transactions by amount in descending order and return only the first 100 lines for easy observation.

$ sort -t, -k2 -nr transactions.csv | head -n 100

Reverse the results.

$ sort -t, -k2 -n transactions.csv | head -n 100

Add a column with line numbers to the ledger content.

$ awk 'NR==1 {print "\"STT\"," $0} NR>1 {print NR-1 "," $0}' transactions.csv | head -n10

Add a column with the account number at the end, where the account number is extracted from the transaction content (if present).

$ awk -F, '
BEGIN {OFS=","}
NR==1 {print "STT," $0 ",\"Account Number\""; next}
{
    match($3, /tu ([0-9]{7,})/, m)
    account_number = (RSTART > 0) ? m[1] : ""
    print NR-1, $0, account_number
}' transactions.csv

And there are many more examples. Please read the documentation or ask ChatGPT to write commands for your needs.

The above examples are what I used to extract a few pieces of information to satisfy my curiosity. I want to emphasize the power of Linux data processing commands. We can combine these commands to find the information we want.

Premium

The secret stack of Blog

As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!

View all

Practicing Data Processing by Using Commands on the MTTQVN Ledger File

The Problem

Practice

Upgrade to Premium

Premium

Premium Plus

Premium