Hello everyone,
Guess what ? Here’s much awaited(I hope so) technical post.
At LeetCode we conduct weekly contests during which we get week’s peak traffic. I had mentioned earlier in one of my blogpost about how one contest went wrong.
Here’s the crux of it:
Servers couldn’t handle the traffic. Mysql latency skyrocketed due to which all the uwsgi processes were I/O blocked waiting for the response from the mysql. Since uwsgi stopped responding, load balancer’s failover got triggered and it disabled the traffic from one of the kubernetes cluster, resulting in even worse. So, it was kinda deadlock situation which we cleared manually.
We have NewRelic APM configured through which we can monitor application performance and try to figure out bottleneck but in the production environment, there were just too many variables due to which it was difficult to narrow down the issue. Moreover, since contest is organised weekly, it’s not efficient to wait and watch how our changes would affect performance when traffic rises many folds.
So, it arose the need of setting up load testing framework which could help us find root cause. After checking up couple of options, I decided to give “Locust” a try since it seemed
simple to start with and scalable too. Overall I really loved the simplicity “Locust” provided. I modified our application code to support privileged users who can bypass certain security measures and could be used for load testing. Then I defined user behaviour using locust’s APIs and tried to test it. I was able to get quite good throughput from locust running on single instance but not good enough to simulate traffic as that of contest. So, it was time to scale up locust.
I encapsulated my locustfile in docker container, wrote helm charts and deployed it to new Digital Ocean’s kubernetes cluster. I was able to do it pretty quickly due to simplicity locust provides. I scaled up that cluster and was able to figure out bottleneck. We had to tune certain mysql parameters, clean up caching logic in application code and scale up our application servers. That’s it and we were able to perform much better during next contest.
You just have to define the user behaviour. An example from locust.io looks like:
|
From developer point of view using the locust framework, here’s how its basic architecture looks like:
But even with single node it was able to achieve quite good throughput. It truly gained my attention and I decided to check its internals.
So, the way it achieves that kinda performance is it is entirely written based off events on top of high performant network library gevent library.
I went through its code and here’s abstract summary from my findings. It uses ZeroMQ for task distribution among different node using ZeroMQ’s Dealer-Router pattern with MessagePack for application level serialisation / deserialisation of the messages. Based on the hatching amount and rate, Master would send message to all the clients to start hatching users. Though the distribution doesn’t take hardware specification of the slave node into the account.
for client in (ready + running + hatching): |
Then the slaves would receive the configuration and start hatching new users and execute tasksets defined by the user in locustfile.py
. Every spawned user will run independent of each other in separate greenlet(If you do not have idea about greenlets, then you can think of them as micro-threads which are pseudo-concurrently and can run on just couple of threads). The logic to execute the taskset is pretty obvious. On every event like user hatch, request complete/fail - they will keep reporting their status back to master through which master can track everything and aggregate it to show us all the stats. Fairly simple and straightforward, right ?
That’s it for now, thanks for reading!