Leetcode Outage on Nov 19 - Post analysis
It’s been long time since I posted something on my blog. Lately I have been pretty busy with interesting and/or funny things happening in my life. I have draft posts regarding those which I may make public one day.
Last week, LeetCode experienced an issue due to which it was completely down for 2 hours 35 minutes and continued degraded service for another 47 minutes. We all at Leetcode apologizes for inconvenience you have faced. At LeetCode, we take pride in building highly available systems that provide delightful user experience, and we are aware of the trust you place in LeetCode. That being said, we certainly would like to share our learnings from post analysis our engineers have performed and steps we have taken to make our system more resilient and fail proof.
1:58 PM, Nov 19 PST
Partial recovery - 04:33 PM, Nov 19 PST
Full recovery - 5:20 PM, Nov 19 PST
LeetCode’s architecture looks some what like below -
Our architecture is designed to be scalable as well as fail-safe. Despite the layer of redundancies, there still exist possible cases where our architecture may fail. On Nov 17, there was an accidental removal of kubernetes’s internal
cilium-operator service from the cluster. Even though it’s non-critical service responsible for garbage collection of the network policies, but due to its bad recovery, it stopped that garbage collection. On 1:58PM Nov 19 PST, we hit the limit of network policy map, breaking all the networking and halting communication between all the components of the cluster that led to this down time.
1:58PM, Nov 19 PST
Internal monitoring systems started generating alerts. We discovered site is down. By 2:04PM, all the devops were triaging the systems to figure out why application servers were not serving any requests.
02:09PM, Nov 19 PST
We discovered that it’s due to DNS Server containers being in PENDING state in the cluster. We found that they were stuck with cilium(k8s’s networking layer) related errors.
2:21 PM, Nov 19 PST
We felt that it’s a major issue and debugging current cluster state would take lot of time. We decided to fire up new kubernetes cluster since user facing services in the cluster are all stateless.
2:57 PM, Nov 19 PST
Finally new cluster provisioning finished but suddently old cluster started serving requests. We were confused and were investigating why it started serving requests back.
3:07 PM, Nov 19 PST
Old cluster failed again. We started working on that newly created cluster again.
3:18 PM, Nov 19 PST
We tried to setup production services on new k8s cluster but due to difference in k8s cluster version, our automatic provisioning scripts were failing. We decided to fallback to restoring our old architecture which didn’t use kubernetes but wasn’t much scalable.
4:19 PM, Nov 19 PST
We restored our old architecture and site was up partially but due to large amount of traffic, latency was quite high and UX was not upto the mark. We started working back on creation of kubernetes cluster.
5:20PM, Nov 19 PST
New cluster provisioning finished and all the user facing services were setup in the cluster. We switched the traffic to new cluster and site was back to normal.
Built over top of BPF interface (Berkley Packet Filter), Cilium implements networking stack of k8s. To give basic overview and bare explanation of what it does is, it assigns IP to each container/service/endpoint - all components of k8s, transparently managing communication between different components at different layers etc. It maintains a policy map which contains all these kind of rules regarding address translation, packet dropping policies and more. On kubernetes cluster, we have
cilium-operator services running. Former one is responsible for implementing this network functionality and later one is responsible for clearing up entries from that policy map as new nodes/pods/k8s_component gets added/removed. So, cilium-operator is just responsible for garbage collection.
When we accidentally deleted cilium-operator from the cluster, we had to restore them back from the other kubernetes cluster since this service is setup into the cluster by Cloud Providers. After restoring the service, cilium-operator pod’s status was changed to “Running” as well as “Ready” but since we weren’t very much aware of its internal working, things seemed fine to us.
Since the restored cilium-operator service was from newer version of kubernetes, it’s specification differed, due to which it wasn’t cleaning up that map. As time passed, new nodes were added, new pods were deployed - we hit the upper limit of the configured policy map i.e. 16384. As the limit was hit, cilium stopped adding new entries to it due to which we faced outage.
- Setup better RBAC so that one cannot mess up with kubernetes internal components.
- Adding up one more layer of redundancy by setting up backup kubernetes cluster in different data center which would be running all the time. In case of any production issue, we can just scale it up and modify Gateway configuration to send traffic to it.
We know how much you rely on LeetCode for anything related to your career. We feel happy when you make one step closer towards your aim through LeetCode. We are really passionate about the availability our services. We have been continously learning from our mistakes and improving to deliver best UX to our beloved users.