Recently, the Central Committee of the Vietnam Fatherland Front (MTTQVN) has uploaded 12,028 pages of ledgers of donations to support people affected by storm No. 3. Right after that, many lively discussions around this topic broke out on social media. And quickly, many people created websites to look up the ledger information. Just enter any content in the search box, click the button, wait a moment, and the found data will be displayed on the screen.
As we all know, the MTTQVN ledger files are in PDF format, so the servers of these websites must have gone through a pre-processing step. The easiest way I can think of is to extract the data into lines of text with structured content, put it into a database like SQL, and use query statements to search. That's how the problem is solved.
After trying to search for information on a website, it seemed like the number of visitors at that time was very high, so the server constantly reported a 500 error. Hmm, for someone who loves speed, this was a terrible experience. Naturally, an idea popped up in my head: I will create a search tool for this data right on my computer, won't it be faster and more accurate?
I thought about it and started searching for the ledger file. It was stored on Google Drive. I was delighted to click the download button, but what a surprise when Google warned that this file had been downloaded too many times, and I had to wait 24 hours to download it!? And even then, would I still have the enthusiasm to do it?
But somehow, someone must have downloaded the ledger file, and they would upload it somewhere. I searched for the name of the ledger file, and finally, I downloaded it. However, extracting information from a PDF file is not easy, so it needs to be converted to another format like CSV. CSV is perfect in many data processing cases.
I used a few online conversion tools, but they didn't seem accurate enough. I was about to write a piece of code to extract the information when I thought: Maybe someone has done it, so I don't have to waste time doing it again. And indeed, after a few search steps, toidicodedao shared the CSV ledger file in a Facebook post. I sent a deep thank you and quickly downloaded it.
Alright, the data is complete. From now on, let's start exploring the data together. But not by putting it into a database and executing query statements, but by using a simpler way: leveraging the power of Linux commands.
However, I hope that you, the reader, already know some useful command-line tools in Linux like cat
, grep
, head
, tail
, sort
… and some advanced commands like sed
and awk
. You don't even need to know how to use them, just knowing what they are used for is enough. Because later, if you're interested, you'll learn how to use them.
First, let's try searching for a transaction by someone's name. The downloaded file is named transactions.csv
. Here, I try searching for the name "tran thi thuy linh" using cat
and grep
.
$ cat transactions.csv | grep "tran thi thuy linh"
No results are found, simply because grep
distinguishes between uppercase and lowercase letters. To ignore the case, add the -i
flag to grep
:
$ cat transactions.csv | grep -i "tran thi thuy linh"
However, this is not the best way; instead, use grep
directly.
$ grep -i "tran thi thuy linh" transactions.csv
Next, let's count the number of ledger lines:
$ awk '{count++} END {print count}' transactions.csv
However, this number includes the header line of the CSV file, so subtract 1 to get the actual number of transactions. Alternatively, you can process it with a command.
$ awk 'NR > 1 {count++} END {print count}' transactions.csv
awk
is a powerful data processing command. It can be considered a programming language for data. Because you can write logic within awk
to do what you want.
Count the total amount of donations:
$ awk -F, 'NR > 1 {sum += $2} END {print sum}' transactions.csv
Count the total amount of donations on September 10, 2024:
$ grep "10/09/2024" transactions.csv | awk -F, '{sum += $2} END {print sum}'
One of the things I love about Linux commands is their "pipeline" property, meaning the output of one command becomes the input of another. Connecting commands creates a seamless processing process.
Sort the list of transactions by amount in descending order and return only the first 100 lines for easy observation.
$ sort -t, -k2 -nr transactions.csv | head -n 100
Reverse the results.
$ sort -t, -k2 -n transactions.csv | head -n 100
Add a column with line numbers to the ledger content.
$ awk 'NR==1 {print "\"STT\"," $0} NR>1 {print NR-1 "," $0}' transactions.csv | head -n10
Add a column with the account number at the end, where the account number is extracted from the transaction content (if present).
$ awk -F, '
BEGIN {OFS=","}
NR==1 {print "STT," $0 ",\"Account Number\""; next}
{
match($3, /tu ([0-9]{7,})/, m)
account_number = (RSTART > 0) ? m[1] : ""
print NR-1, $0, account_number
}' transactions.csv
And there are many more examples. Please read the documentation or ask ChatGPT to write commands for your needs.
The above examples are what I used to extract a few pieces of information to satisfy my curiosity. I want to emphasize the power of Linux data processing commands. We can combine these commands to find the information we want.
Hello, my name is Hoai - a developer who tells stories through writing ✍️ and creating products 🚀. With many years of programming experience, I have contributed to various products that bring value to users at my workplace as well as to myself. My hobbies include reading, writing, and researching... I created this blog with the mission of delivering quality articles to the readers of 2coffee.dev.Follow me through these channels LinkedIn, Facebook, Instagram, Telegram.
Comments (0)