Shortly before noon on Tuesday and for a period of almost an hour, a good part of the main websites around the world, such as Amazon, Twitch, New York Times, HBO Max, Hulu, Spotify, Reddit or even EL PAÍS, began to register operational problems. In some cases, these became inaccessible to many users. The reason, one of the secondary links of the system, a company called Fastly, suffered an error in its systems that caused the chain fall of all the companies it serves.
The incident was resolved just an hour later, and panic gave way to humor and memes. But, beyond the anecdote, the event once again revealed the weakness of the configuration of a network of networks on which communications, the economy and the functioning of modern societies are based. Especially at a time when a large percentage of companies – 43% in Spain, according to the INE – make use of teleworking.
“This event highlights the fragility of the system on which the internet is based,” says Igor Unane, technical manager of S21. The key, says this engineer in telecommunications systems, “lies in the concentration of a structure in which” a series of large manufacturers are monopolizing the hegemony. And Jordi Serra, professor at the UOC, completes: “The system is weak because sometimes it depends on a single point in this content cloud”. The key to solving the problem is through distribute game: that a single link cannot cause a general failure. The problem for this: the costs.
There are more than 1.8 billion web pages worldwide, according to data from Internet Live Stats. These pages need the services hosted in the cloud, that is, on expensive external servers distributed all over the planet. This article that you are reading now is hosted in the cloud as well as a large part of the services that millions of people access daily: Gmail, Spotify or WhatsApp, and also the devices we have at home, such as Alexa or Google Home. Six out of ten web sites or services worldwide depend on just three providers: Amazon Web Services, Microsoft Azure, and Google Cloud. And next to these three giants, on a second level, are other firms called content delivery network (CDN, for its acronym in English). The best known are Cloudfare, Akamai and Fastly, the cause of Tuesday’s ruling.
A CDN is basically a network of servers in different data centers around the world that are dedicated to temporarily storing copies of your clients’ pages. The idea is to avoid that the geographical remoteness of a service or its central servers, or high user demand, can cause a page to take time to load or even cause the system to crash.
Fastly published at 11:58 a.m. (Spanish peninsular time) an incident that stated: “We are currently investigating the possible impact on the performance of our CDN services.” At 12:44, the company assured that the problem had been identified and that it was already being fixed. Nine minutes before 3:00 p.m., he concluded the incident. The reasons for the fall have not been entirely clear. The affected company explained on Tuesday that it had “identified a service configuration” that caused “interruptions in the points of sale worldwide”, so it had proceeded to deactivate this configuration. The company quickly ruled out, yes, that there had been a computer attack.
In any case, whatever the reason, the doubts centered on the vulnerability of the network as a whole. The key to solving the problem would be to distribute the game: that a single link cannot cause a generalized failure. But there is a problem with it, the costs. Also, that companies like Fastly do not have too many competitors. To begin with, this business requires heavy investment in infrastructure, which limits competition in the sector. In addition, it is not profitable for companies to have several suppliers. “These situations are unavoidable when we depend on a single provider,” explains Unane, a telecommunications systems engineer. “It’s like counting two companies to put the telephone and fiber at home, why pay double for a day that you can run out of internet?”
It is not the first time that a similar fall has occurred. In November, the Amazon Web Services servers registered a failure that eventually caused home cleaning robots that needed the cloud to stop working. In 2017, this company registered an even bigger problem, lasting five hours during which chaos spread across the network. Amazon, in addition to apologizing, then explained that it was all due to an error on the part of an employee. You typed a typo in your code and the servers stopped working. “Unfortunately, one of the command signs went wrong and a large number of servers went down,” the company said at the time.
Another large company suffered a sharp decline last December. Most of Google’s services (Google, Gmail, Google Docs, YouTube and its cloud) were inactive for an hour due to an internal storage problem. The ruling affected millions of people around the world who had adopted its tools for their remote work.