Batch Processing

Batch Processing

The Problem

Writing about certain topics can be quite challenging. Not because they are difficult, but because it's hard to present them in a structured and easy-to-understand way. Since I learned about the saying "You only truly understand a topic if you can explain it to others," I've been following this principle as a commandment in my mission to convey content through writing. In my list of drafts, I've missed out on many topics just because I couldn't achieve this goal. But it's okay, I'm sure that gradually, I'll "truly understand the topic" eventually.

Batch processing is one such topic, a bit broad and somewhat abstract. But I'm sure that once you grasp what it is, the process of handling information will become more efficient in any system. Therefore, I hope that through this article, you'll understand the benefits of batch processing and how it can improve your system's performance.

Let's start with a simple story.

Loops are a fundamental concept in programming languages. Any list can be looped through to "iterate" over each item. Sometimes, you don't even need to loop through a list, but rather loop a certain number of times to perform a task that requires that many iterations. For example, calculating the sum of numbers from 1 to 99.

let sum = 0;
for (i = 1; i <= 99; i++) {
    sum = sum + i;
}

This is a classic problem, and I'm sure anyone who has just started programming can solve it. Perhaps that's why sequential processing has become so familiar and sometimes even a habit when solving problems.

Looping until i is less than or equal to 99 means we're performing a task 99 times. That's nothing compared to the power of a modern CPU, which can perform billions of calculations per second. But 99 is just an example, and sum is just a simple addition. Think about it in real life - how complex is the logic in your loop?

Here's a code snippet that calculates the sum without looping.

let n = 99;
let sum = (n * (n + 1)) / 2;

Everything is solved in a single statement. A batch of 99 elements is condensed into a single calculation. And it's clear that this is more efficient than looping 99 times. Therefore, batch processing reduces the frequency of looping or increases the amount of processing done simultaneously to improve application performance.

Let's go back to a more "realistic" problem. Suppose you receive a user's id. Your task is to retrieve the user's articles, comments, and count the total number of likes. Fortunately, the data you need to retrieve is already written in separate functions, which are basically three queries in three different tables.

async function getArticles(userId) {
    ...
}

async function getComments(userId) {
    ...
}

async function countComments(userId) {
    ...
}

Normally, you would write:

const userId = body.userId;
const articles = await getArticles(userId);
const comments = await getComments(userId),
const numComments = await countComments(userId);

This is a bit expensive, as numComments needs to wait for comments, and comments needs to wait for articles. However, these three functions are independent of each other. You might notice that Promise has an all function to execute multiple Promises simultaneously:

const userData = await Promise.all([
    getArticles(userId),
    getComments(userId),
    countComments(userId),
]);

Now, userData[0] contains the articles, userData[1] contains the comments, and userData[2] contains the total number of likes.

The numbers 0, 1, 2… are quite vague and lack information. Instead, I prefer to use Bluebird to make it clearer.

const userData = await Bluebird.Promise.props({
    articles: getArticles(userId),
    comments: getComments(userId),
    numComments: countComments(userId),
});

Now, userData.articles contains the articles, userData.comments contains the comments, and userData.numComments contains the total number of likes.

That's how you process a batch of Promises!

Background jobs are tasks that need to be processed in the background. This is partly to increase performance and partly because not all tasks need to be processed immediately. For example, tasks that aggregate data at the end of each day are usually handled by background jobs.

Your task is to aggregate some statistics every day at midnight. The logic is to add or update the database. If you follow the usual approach, you would retrieve the list of users (users), count the total number of likes for each user, and then create a record in the database.

for (const user of users) {
    const numComments = await countComments(userId);
    await insertCountComments();
});

With each insertCountComments, you're adding a record to the database. If you have 1 million users, you're performing 1 million separate insert operations into the database. Instead, collect the records and perform the insert operation in a single batch, also known as a "bulk insert" - which is proven to be more efficient than inserting individually.

const records = [];
for (const user of users) {
    const numComments = await countComments(userId);
    records.push(numComments);
});

await insertCountComments();

That's how you solve the problem of adding records to the database in bulk!

When working with message queues, a Producer continually sends messages to a queue, and the queue pushes each message to Consumers that are connected. A production-consumption cycle is formed, like a modern assembly line.

What happens when the messages are almost identical in terms of content and structure? This means that the processing behavior for each message is almost the same. For each message, we need to "poke" the database to retrieve the necessary information. However, if we group the messages together, we can significantly reduce the number of database queries.

Depending on the tool you're using, it may or may not support sending a batch of messages. For example, in RabbitMQ, you can send a batch of messages in a single operation. This behavior can increase the throughput by tens of times compared to the usual approach of sending individual messages.

When working with Cloudflare, especially Cloudflare Queues - similar to a message queue, you can configure the Producer to wait until it has sent a certain number of messages before pushing them to the Consumer. For example, the Producer can send individual messages until it reaches 100 messages, and then the 100 messages are pushed to the Consumer in a single batch.

Once the batch of messages is pushed to the Consumer, you can apply batch processing again if the conditions permit.

Finally, there's another case where batch processing can be useful!

I/O operations are asynchronous behaviors that can fail at any time, as they depend on external factors like network speed and hardware performance. If you're not careful, you can encounter many errors during processing.

Instead of calling a function that performs I/O operations repeatedly, try reducing the frequency to reduce the load. For example, logging data to a file. To log a single line, you need to perform operations like opening the file, writing, and closing the file. Instead, collect a batch of raw data and perform the open-write-close operations in a single batch.

Conclusion

The above scenarios are where I commonly apply batch processing. However, there are many more cases where batch processing can improve system performance. Do you know of any other ways to apply batch processing? Please leave a comment below!

or
* The summary newsletter is sent every 1-2 weeks, cancel anytime.
Author

Hello, my name is Hoai - a developer who tells stories through writing ✍️ and creating products 🚀. With many years of programming experience, I have contributed to various products that bring value to users at my workplace as well as to myself. My hobbies include reading, writing, and researching... I created this blog with the mission of delivering quality articles to the readers of 2coffee.dev.Follow me through these channels LinkedIn, Facebook, Instagram, Telegram.

Did you find this article helpful?
NoYes

Comments (0)