Facebook down! Just what caused the October outage?
The temporary shutdown of a social media site is no big deal, right? It is when that site is ingrained in the fabric of many other things that we use every day. Geoff Meads explains the great Facebook shutdown of October 2021.
On October 4, 2021, Facebook and its related services went down for approximately five and a half hours. The Facebook conglomerate includes sister services WhatsApp and Instagram plus Facebook’s native messaging, commerce and ‘Login with Facebook’ technology. In short, the outage was a big deal for a lot of people.
While we can all probably live a few hours without seeing what our favourite celebrities are having for dinner, messaging services like WhatsApp and Facebook messenger are regularly used to organise all sorts of activities so real-time dependability is hugely important.
ADVERTISEMENT
In addition, when the popular ‘Login with Facebook’ service stops working, so do a huge number of websites that use that service. That’s around 160,000 websites and apps at time of writing.
So, what exactly happened? Get ready for some new acronyms…
UTC
The timeline in this story will be described using the Coordinated Universal Time (UTC) standard so let’s start there.
UTC is used to coordinate online services around the world without the need to consider time zones. It is the same time standard as both Greenwich Mean Time (GMT) and Zulu time (used in military circles) and is not subject to any daylight savings changes.
DNS
We have discussed the Domain Name System (DNS) in previous articles. The short explanation being that DNS is a giant phone book that associates domain names (like www.facebook.com) to an IP address. That IP address is where the service behind a domain name (Facebook’s website in this case) can be found using layer 3 IP routing.
Copies of the DNS register can be found in thousands of places around the Internet. When changes are issued it takes a while for all copies to be updated. If an incorrect change is issued (a DNS record suddenly points to the wrong IP address or no IP address at all) then users will no longer be able to connect to the associated domain unless they use the correct IP address directly. Basically speaking, when DNS goes down so do the services that use it.
But surely Facebook has more than one IP address? Yes, it does! In fact, the outage was far more involved than the loss of a single record for a single domain name.
ASN
Large online companies don’t just have single services connected to the Internet but control a whole chunk of the Internet for themselves. They can be considered sub-networks within the larger network we call the Internet.
Theses chunks have an ID called an Autonomous System Number or ‘ASN’. An ASN advertises itself to other Internet systems by publishing a set of routes to and through its constituent parts. It was a change to one of these advertisements that occurred on October 4 and resulted in the outage.
BGP
The actual detail of ASN advertisements adheres to yet another protocol called Border Gateway Protocol. BGP is a standard format for the advertisement of routes to and through systems and is used by an ASN to advertise its configuration so that incoming traffic can reach and navigate through an ASN.
Timeline
According to Cloudflare.com (a huge Content Delivery Network or ‘CDN’) the Facebook ASN issued an update to its routing using the BGP standard at approximately 15:40 UTC including a large number of routing changes. This update included incorrect information about how to reach Facebook services including their DNS records.
By 15:51 UTC Facebook’s DNS records were unavailable, and DNS requests for Facebook services simply returned a ‘’SERVFAIL” error meaning the request had failed.
The result of this meant that any browser, app, or messaging service looking to use Facebook services couldn’t connect. Even though the systems themselves were working there was no way of getting to them because the DNS was no longer responding. Imagine the contacts app in your phone stops working. All the people stored in the app still exist but you have no idea what their numbers are so you can’t call them.
Shortly after, modest panic ensued. Users, hearing that Facebook’s services were down on other platforms such as Twitter, tried to connect to Facebook, Instagram and WhatsApp as if to check. They did this many times… Again, according to cloudflare.com, DNS requests for Facebook services rose to more than 30x the usual number during the outage.
Users then turned to services like Twitter to announce their displeasure and ask others what was happening. The effect of this was a massive upswing of DNS requests to other messaging services causing a near doubling of requests for these.
At around 21:00 UTC the Facebook ASN issued a corrective update, again via BGP, and the changes started to propagate around the Internet. As a result, the Facebook’s DNS server started to be reachable again and Facebook started to come back online. By 21:20 Facebook itself was mostly back and the other services, such as Instagram and WhatsApp shortly followed.
Quality Control
“How on earth can this happen?” I hear you cry “surely there is some kind of quality control for these things?”. An interesting question, yes there is…
In fact, as clarified in a post by Facebook themselves the day after the outage, commands like this go through an automated quality control audit before they are released. But guess what…? The automated audit system had a bug in it…
What can we learn from this as integrators?
Well, the first takeaway here must be – even the big guys get it wrong sometimes! No matter how carefully we plan and build our systems we are always at the mercy of the actual code that runs them.
The next take away is that we need a truly deep understating of the systems we work with. The fact that Facebook diagnosed and fixed an issue that happened in such a massively complex system in a relatively short time is impressive. While smart home systems are nowhere near as involved, they can get quite complex at times. A robust and up-to-date set of documentation plus clear and well-rehearsed troubleshooting procedures are a must for the modern integrator.
We now live in a world where things are just expected to work, and tolerance of mistakes is low. If we accept that, no matter how well we plan sometimes things will still go wrong, then we need to plan our response well in advance and be ready to make things right quickly.
-
ADVERTISEMENT
-
ADVERTISEMENT
-
ADVERTISEMENT
-
ADVERTISEMENT