It’s the front door of the internet for at least three billion of us, and one of the companies that underpins large swathes of the online world. But an embarrassingly long six-hour Facebook outage on 4 October highlighted just how significant a role the company plays in our day-to-day digital lives – and left many wondering what could possibly have gone so wrong.
What caused the outage?
In a statement, Facebook’s vice-president of infrastructure, Santosh Janardhan, said that “configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication”. In essence, a likely innocuous error by a Facebook employee managed to make it impossible to access Facebook services.
To learn why a small change had an outsized impact, you need to look at how the internet is structured. “The clue is in the name,” says Alan Woodward, professor of cybersecurity at the University of Surrey. “It’s a network of networks. The internet is a collection of interconnected networks.” Because of their ubiquity and power in driving large parts of how we communicate, some of the biggest digital companies in the world – including Facebook, Google and Microsoft – have responsibility for running the top tier of the internet.
At that level, it is a series of nodes: when a user visits one of these nodes, their request to visit a website is interrogated and cross-checked against a routing table – which is a list of every batch of internet protocol (IP) addresses issued.
What is an IP address, and how does it work?
IP addresses are sort of like the longitude and latitude of the planet. They tell you with pinpoint accuracy where a particular website or service is hosted. But we tend not to use IP addresses in everyday web browsing, because they’re longer, complicated numbers. Instead, we use URLs – the digital equivalent of house numbers and street names. Type facebook.com into a web browser and it’s translated into an IP address. The first few numbers of an IP address reveal which of the big nodes to go to, while the remainder give more explicit directions.
That’s done through a protocol called the Border Gateway Protocol (BGP). But it relies on the routing tables stored by these big providers being correct. And in Facebook’s case, the configuration was screwed up. “People tried talking to Facebook, but the system was looking around saying, ‘There’s no such thing as Facebook,’” says Woodward. “Some registrars would sell you facebook.com because as far as they were concerned, it didn’t exist.”
Facebook had managed to delete itself from its own routing tables, meaning anyone who typed in facebook.com – or any website that connects to Facebook using facebook.com – was lost. It’s like traversing a huge metropolis of complicated, interconnected roads without a map.
What was the underlying issue?
One of the problems that the outage demonstrates is the fallibility of the internet and the infrastructure that powers it, as well as the trouble with concentrating that power in just a few hands – and there are catastrophic consequences if something goes wrong.
Although we believe that the internet just operates independently, with computers whirring and buzzing and smoothly handling everything we throw at it, like all technology, it has a human parent that has decided how it will work. Technology follows the instructions we provide it – and sometimes we give it incorrect instructions, with computers unable to think for themselves to question whether you really want to change things in a way that could break the internet.
Why did it take so long to fix?
A big issue was that Facebook employees reportedly struggled to access their own buildings and networks because they also run on Facebook servers – which didn’t exist in the eyes of the infrastructure running the internet. “It looks like they had to physically go where the server was and log in manually there and set it back to where it was before,” says Woodward. This reveals how Facebook itself is hostage to fortune.
What does this tell us about Facebook’s dominance?
“You get these cascading effects,” continues Woodward. “It shows how they [Facebook] hold a disproportionate sway over how the whole thing operates and functions, such that when they malfunction, it’s not even like an ISP [internet service provider, such as BT, TalkTalk or Sky] going down where the customers go down – it affects the rest of the internet.”
For example, there were issues for services that rely on Facebook for part of their operations. In an attempt to make it easier for users to sign up to systems and not have to remember reams of usernames and passwords, many platforms offer the ability to log in using a Google or Facebook account. That’s brilliant and convenient when it works – but it does mean that if a Facebook account is used to log in to a third-party service, and Facebook goes down, access to that service is lost. “A lot of new online services will use one of the main social media platforms as their sign-on,” explains Woodward. “If Facebook disappeared, who’s liable for that?”
What can be done to make things better?
For one thing, double-checking work before making it live and potentially bringing down large parts of the internet. “We’re working to understand more about what happened today so we can continue to make our infrastructure more resilient,” wrote Facebook’s Janardhan. But trying not to put all your eggs into one digital basket owned by Facebook is another, smaller solution that individuals can try.