Why Facebook Went Down & How Processes Are Critical to Prevent Catastrophic Product Failure
Facebook’s Six-Hour Outage From Hell
On October 4th 2021 at around 4 pm GMT the internet giant Facebook, as well as its affiliate apps WhatsApp and Instagram, came to a complete standstill. The seismic digital event was felt across the world as users and businesses began to notice messages weren’t sending, images weren’t posting and their updates were falling on deaf ears.
As everyone ran to Twitter, LinkedIn, Reddit and Tiktok to vent their frustrations and create glorious memes - the entirety of Facebook was in pitch black.
So, what actually happened?
In a statement by Facebook, the outage was pinpointed on a faulty configuration change on the backbone routers. These routers coordinate network traffic between Facebook’s data centres, essentially, the foundation on which all of Facebook is built. Consequently, the fault had a “cascading effect on the way our data centres communicate, bringing our services to a halt.”
To get a bit more technical, the real failure was with the withdrawal of the Border Gateway Protocol (BGP). Essentially, this is a collection of huge servers that drive the internet, they constantly update and share information across the web and plug into what we see as the internet - a network of networks. In this instance, Facebook’s BPG routes were withdrawn and that’s where the failure began.
N.B. In general, when making changes to your services, determining a risk factor is essential. From there, you can establish required confirmations based upon these factors to avoid any issues.
In Facebook’s case, if there were a risk that updating your BGP config could remove your entire business from the internet, then the configuration update should have been reviewed by at least five senior network engineers with impact assessment from each.
The aftermath not only saw Facebook’s social media apps fail but all of Facebook’s internal processes like ‘Facebook Workspace.’ This includes company emails, remote testing capabilities and diagnosis tools.
Perhaps most interestingly, staff couldn’t get into the server farms to manually reconfigure the servers because the access controls on the doors were controlled by services running on the Facebook domain - the domain that had just vanished. So, the tech giant was taken back to the stone age, having to manually repair the issue at the source in the server room, according to a reporter at the NY Times.
Once they could get in of course ( ͡° ͜ʖ ͡°)
Processes are critical to preventing catastrophic failure
In terms of owning a digital product, the first lesson we can learn from this is a very simple one. If your product doesn’t work, your users will flock straight to your competitor.
The evidence? Just take a look at this image from Cloudflare below. The graph shows the jump in queries for alternative social media platforms like Twitter and Tiktok following the outage.
Though we may never know the exact details that lead to this particular outage, one thing is for sure: The processes leading towards deploying new releases and feature updates can make you or break your product.
What processes can help avoid critical delivery failures?
Having a clear disaster recovery plan
Though disaster may never strike - failing to prepare, is preparing to fail. That is why it is crucial to have a strong IT disaster recovery plan. The plan should document how long it takes to stabilise, should the worst happen. This includes setting disaster recovery priorities, so you know where to focus your efforts in time-sensitive scenarios.
It is also vital to have this strategy in place to revive hardware, applications and data in line with the recovery time established in the plan.
Creating and maintaining SOPs
Standard Operating Procedures or SOPs are at the absolute core of robust software development. It is a concise guide detailing how a technical team delivers a piece of software.
Creating SOPs can be time-consuming but tends to be worthwhile, as it allows you to assess exactly what is required from each operation, who is involved and what best practice looks like. From here, you can create a guide that is accessible to every team member.
Testing & Regression Strategy
Test. Test. Test.
The best way to prevent failures is to make sure you catch them before they arise. Having a clear testing process must be established and adhered to. Preventing issues before, during and after deploying digital products.
Processes to include could be:
For example, pull requests allow developers to document every change they make to a codebase and ensure that they can undergo peer review. This double-pronged approach ensures that nothing slips through the cracks and you can avoid any Facebook-sized headaches.
Although poor QA won’t have caused Facebook’s downfall yesterday, it is an essential process in delivering digital products that meet users expectations.
QA helps spot problems with a product from a UX point of view, rather than a technical one. For example, the code itself may be correct when launching a new timer on your app, but only the QA who checks the change will see the finer detail. This is - the numbers from the timer display are too small for anyone to see.
Don’t put all your eggs in one basket
Finally, having a set of distributed services, rather than a single point of failure is crucial. According to Ian, our technical director at KOMODO:
“The best thing about the internet is the opportunity for decentralisation.”
What happened at Facebook, based on its current set-up, was a ticking time bomb. With everything running internally on the Facebook domain, all it took was one fault to cripple the entire network. Listen to Ian. Embrace decentralisation.
Review and lessons learned
There’s no doubt Mark Zuckerberg and his CTO Mike Schroepfer will have a lot of questions to answer today from their users. Of course, people will be looking to place the blame somewhere, however, If one person in your business has the power to make a mistake that does this much damage, the business failed, not the individual.
Furthermore, the impact of their products going down worldwide will not only have meant that people couldn’t post updates. But the millions of businesses that use it to operate and advertise will have seen their profits hit.
While your digital product may not have millions of dependent users, it does demonstrate the responsibility tech companies have to build and maintain their processes to deliver the best possible user experience.
Sign up to our newsletter
Be the first to hear about our events, industry insights and what’s going on at Komodo. We promise we’ll respect your inbox and only send you stuff we’d actually read ourselves.