Practicing Data Processing by Using Commands on the MTTQVN Ledger File

Practicing Data Processing by Using Commands on the MTTQVN Ledger File

Daily short news for you
  • For a long time, I have been thinking about how to increase brand presence, as well as users for the blog. After much contemplation, it seems the only way is to share on social media or hope they seek it out, until...

    Wearing this shirt means no more worries about traffic jams, the more crowded it gets, the more fun it is because hundreds of eyes are watching 🤓

    (It really works, you know 🤭)

    » Read more
  • A cycle of developing many projects is quite interesting. Summarized in 3 steps: See something complex -> Simplify it -> Add features until it becomes complex again... -> Back to a new loop.

    Why is that? Let me give you 2 examples to illustrate.

    Markdown was created with the aim of producing a plain text format that is "easy to write, easy to read, and easy to convert into something like HTML." At that time, no one had the patience to sit and write while also adding formatting for how the text displayed on the web. Yet now, people are "stuffing" or creating variations based on markdown to add so many new formats that… they can’t even remember all the syntax.

    React is also an example. Since the time of PHP, there has been a desire to create something that clearly separates the user interface from the core logic processing of applications into two distinct parts for better readability and writing. The result is that UI/UX libraries have developed very robustly, providing excellent user interaction, while the application logic resides on a separate server. The duo of Front-end and Back-end emerged from this, with the indispensable REST API waiter. Yet now, React doesn’t look much different from PHP, leading to Vue, Svelte... all converging back to a single point.

    However, the loop is not bad; on the contrary, this loop is more about evolution than "regression." Sometimes, it creates something good from something old, and people rely on that goodness to continue the loop. In other words, it’s about distilling the essence little by little 😁

    » Read more
  • Alongside the official projects, I occasionally see "side" projects aimed at optimizing or improving the language in some aspects. For example, nature-lang/nature is a project focused on enhancing Go, introducing some changes to make using Go more user-friendly.

    Looking back, it resembles JavaScript quite a bit 😆

    » Read more

The Problem

Recently, the Central Committee of the Vietnam Fatherland Front (MTTQVN) has uploaded 12,028 pages of ledgers of donations to support people affected by storm No. 3. Right after that, many lively discussions around this topic broke out on social media. And quickly, many people created websites to look up the ledger information. Just enter any content in the search box, click the button, wait a moment, and the found data will be displayed on the screen.

As we all know, the MTTQVN ledger files are in PDF format, so the servers of these websites must have gone through a pre-processing step. The easiest way I can think of is to extract the data into lines of text with structured content, put it into a database like SQL, and use query statements to search. That's how the problem is solved.

After trying to search for information on a website, it seemed like the number of visitors at that time was very high, so the server constantly reported a 500 error. Hmm, for someone who loves speed, this was a terrible experience. Naturally, an idea popped up in my head: I will create a search tool for this data right on my computer, won't it be faster and more accurate?

I thought about it and started searching for the ledger file. It was stored on Google Drive. I was delighted to click the download button, but what a surprise when Google warned that this file had been downloaded too many times, and I had to wait 24 hours to download it!? And even then, would I still have the enthusiasm to do it?

But somehow, someone must have downloaded the ledger file, and they would upload it somewhere. I searched for the name of the ledger file, and finally, I downloaded it. However, extracting information from a PDF file is not easy, so it needs to be converted to another format like CSV. CSV is perfect in many data processing cases.

I used a few online conversion tools, but they didn't seem accurate enough. I was about to write a piece of code to extract the information when I thought: Maybe someone has done it, so I don't have to waste time doing it again. And indeed, after a few search steps, toidicodedao shared the CSV ledger file in a Facebook post. I sent a deep thank you and quickly downloaded it.

Alright, the data is complete. From now on, let's start exploring the data together. But not by putting it into a database and executing query statements, but by using a simpler way: leveraging the power of Linux commands.

However, I hope that you, the reader, already know some useful command-line tools in Linux like cat, grep, head, tail, sort... and some advanced commands like sed and awk. You don't even need to know how to use them, just knowing what they are used for is enough. Because later, if you're interested, you'll learn how to use them.

Practice

First, let's try searching for a transaction by someone's name. The downloaded file is named transactions.csv. Here, I try searching for the name "tran thi thuy linh" using cat and grep.

$ cat transactions.csv | grep "tran thi thuy linh"

No results are found, simply because grep distinguishes between uppercase and lowercase letters. To ignore the case, add the -i flag to grep:

$ cat transactions.csv | grep -i "tran thi thuy linh"

However, this is not the best way; instead, use grep directly.

$ grep -i "tran thi thuy linh" transactions.csv

grep ledger

Next, let's count the number of ledger lines:

$ awk '{count++} END {print count}' transactions.csv

However, this number includes the header line of the CSV file, so subtract 1 to get the actual number of transactions. Alternatively, you can process it with a command.

$ awk 'NR > 1 {count++} END {print count}' transactions.csv

awk is a powerful data processing command. It can be considered a programming language for data. Because you can write logic within awk to do what you want.

Count the total amount of donations:

$ awk -F, 'NR > 1 {sum += $2} END {print sum}' transactions.csv

Count the total amount of donations on September 10, 2024:

$ grep "10/09/2024" transactions.csv | awk -F, '{sum += $2} END {print sum}'

One of the things I love about Linux commands is their "pipeline" property, meaning the output of one command becomes the input of another. Connecting commands creates a seamless processing process.

Sort the list of transactions by amount in descending order and return only the first 100 lines for easy observation.

$ sort -t, -k2 -nr transactions.csv | head -n 100

Reverse the results.

$ sort -t, -k2 -n transactions.csv | head -n 100

Add a column with line numbers to the ledger content.

$ awk 'NR==1 {print "\"STT\"," $0} NR>1 {print NR-1 "," $0}' transactions.csv | head -n10

Add a column with the account number at the end, where the account number is extracted from the transaction content (if present).

$ awk -F, '
BEGIN {OFS=","}
NR==1 {print "STT," $0 ",\"Account Number\""; next}
{
    match($3, /tu ([0-9]{7,})/, m)
    account_number = (RSTART > 0) ? m[1] : ""
    print NR-1, $0, account_number
}' transactions.csv

And there are many more examples. Please read the documentation or ask ChatGPT to write commands for your needs.

The above examples are what I used to extract a few pieces of information to satisfy my curiosity. I want to emphasize the power of Linux data processing commands. We can combine these commands to find the information we want.

Premium
Hello

5 profound lessons

Every product comes with stories. The success of others is an inspiration for many to follow. 5 lessons learned have changed me forever. How about you? Click now!

Every product comes with stories. The success of others is an inspiration for many to follow. 5 lessons learned have changed me forever. How about you? Click now!

View all

Subscribe to receive new article notifications

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.

Comments (0)

Leave a comment...