How Do I Optimize Our KPI System

The Problem

We have a daily rake task to calculate KPIs for our Rails app. But the problem is that it’s getting slower and slower as we’re getting more and more users. Part of the task is calculation of statistics of user states, which are mainly retrieved from a table calleduser_activities.

Things have gone wild with the user_activities reaching 40M rows, and it’s growing faster than ever. The task can’t be finished in a day! That means there would be several tasks running simultaneously, the server’s load was getting high, the CPU was sweating. Actually, it took a few days to finish a daily task, and what’s worse, the main app could be easily brought down. So we had to stop the daily task, and then I started to investigate it.

Adjust the Architecture

The project we’re working on used to be a tiny app, it’s ok to put everything together. But it’s growing, (and we’re happy to see that), so it’s time to review the architecture. Finally, we came to a conclusion that KPIs calculation shouldn’t be part of the app, it should be a independent OLAP subsystem. So my colleague Raven helped to separate it from the main app.

As you can see from the diagram, other info that the OLAP system needs can be retrieved from the slave DB. And for theuser_activities, it won’t be saved in the master DB anymore, the main app pushes it to Redis via Resque, and the OLAP system reads it from Redis, does some calculations and then saves data to its own DB. So there’s no KPIs-related things on the main app anymore.

Though the main app was updated, the OLAP system was still not working 100% because we didn’t do any optimization at all. It was time to find out bottlenecks and some solutions.

Investigation & Experiments

It’s obvious that it’s a DB problem, because the task was meant to retrieve from DB and do some calculation. By using request-log-analyzer we’re able to detect slow queries like this:

<code>SELECT ua.* FROM user_activities AS ua
LEFT OUTER JOIN users AS u ON u.id = ua.user_id
WHERE (u.origin_locale = 'en-GB') AND (ua.created_at BETWEEN 't0' AND 't1');
</code>

I found out some issues that might cause performance to degrade, such as

  • joins between huge tables
  • fuzzy matching (LIKE 'sendgrid:%:delivered')
  • lack of indexes (or unusable ones)

I spent lots of time learning about MySQL indexing by doing experiments. I gained a better understanding of MySQL, and even though some of the experiment results were thrilling to me, it still didn’t help much to fix the problem.

We shouldn’t expect good retrieval performance on a table that has tens of millions of rows.

Change the Calculation Algorithm

Bodhi pointed out that it’s an algorithmic problem and then I started thinking about whether it was possible to sidestep the huge table.

Parts of the KPIs work is to calculate the amount of users in different states.

Before, we do the calculation like

<code>count = 0
users = User.all
users.each { |u| count += 1 if u.active? }
</code>

This was slow because we had to ask every user if they were active, and User#active? involved querying the hugeuser_activities table.

Every user has a state, why not log it in a user_states table and we can get the amount easily. We’re not like Facebook, we don’t have tens of millions of users. It’s easy to maintain the table.

So I wrote an after_create callback for UserActivity, it works like so: if a login activity was created, then mark the user as active; if a deletion activity was created, then change the user to a delete state. Also, I had to write another rake task to apply other rules to help maintain the user_states table. We’re still relying on the UserActivity to generate user state, but we totally bypass querying the user_activities table.

Testing

It’s far more complicated than I described, so I have to test it again and again. The original code was well tested, which eased the pain while I was writing more tests. Before deploying to the production server,
we set up a test server for it. Then I compared the results with the original production system to check if it met the following questions.

  • Logically correct?
  • How long does it take?
  • Logged well?

Of course it satisfied the requirements, otherwise I wouldn’t be here writing this :)

Result

So what’s the result? The daily task, which used to take a few days, now can be finished in 10 minutes.

What I’ve learned

As a junior developer, I’ve never done this kind of work before. Usually, I fix bugs, or make a new feature, but I lack experience with performance optimization.

There’re several things that I learned from this task:

  • make a plan (even when it is unfamiliar territory)
  • find out the bottleneck
  • optimize algorithm before optimizing lower level code
  • don’t guess, test