Approaches to Rebuild Consumer Trust after Service Outage

It was a depressing start of the week for businesses that solely relied on Facebook, WhatsApp, and Instagram when the global outage occurred around 11.40 a.m. Eastern time on 4 October 2021. Many businesses dependent on these platforms were left hapless as they did not have any backup plans or means to reach out to their customers.

Users who relied on their Facebook accounts as a means of authentication to access resources such as e-commerce, Internet of Things (IoT) devices and mobile games found their daily activities being affected as they could not access their accounts due to the global outage.

While many normal users found themselves inconvenienced, they switched to other apps such as Telegram, Signal, short message services (SMSs) and iMessage to stay in contact with their colleagues, friends, and loved ones. In addition, many business owners affected by the outage are re-considering their business strategies to connect with potential customers.1

Facebook claims that the outage was caused by routine maintenance on the backbone routers.2

 

Was this the first time Facebook experiencing an outage in its services?

No. Facebook had a significant outage in 2019 that lasted for more than 24 hours caused by a server configuration change3. And the occurrence of its worst outage dates as far back as 2008 where it was knocked offline by a bug for almost 24 hours, where about 80 million users were affected.4

 

Was Facebook the only service provider causing a network outage?

Facebook is not the only service provider that has caused a network outage. For example, on 8 June 2021, a bug caused Fastly, one of the world’s leading content delivery networks (CDNs), to have an outage, causing thousands of websites such as Amazon, PayPal, Reddit, etc., to be down for almost an hour.5

Akamai too, was brought down for an hour as the software configuration update triggered a bug in the domain name system (DNS) on 22 July 2021.6 The outage affected major websites and online services such as PlayStation, Amazon Web Services (AWS), Google, Salesforce, etc.

The common cause for the abovementioned scenarios is a single point of failure.7 In this case, a human error resulted in the loss of availability of services to the masses that rely on them directly or indirectly for their daily activities.

 

What can be done to avoid or mitigate such incidents from happening?

To mitigate and reduce the risk of an incident, we should always keep the three fundamental tenets of information security – confidentiality (C), integrity (I) and availability (A) (also known as the CIA triad) in mind. Although all three areas are important to an organisation, different organisations may have different priorities based on their business nature. Thus, they may focus on either one or two of the three tenets mentioned. However, as all three areas are interdependent, the CIA triad will fall apart should one tenet give way.

We will not discuss all three areas of the CIA triad. With users’ dependency on Facebook services during its outage as the key area of concern, the focus instead will be on availability.

In the CIA triad, availability assures that users will be able to access the system and network whenever they need it. However, human errors, natural disasters, distributed denial of service (DDoS) attacks are some examples of threats to availability that we should constantly remind ourselves and be alert to. We should probably even be sceptical to threats like human errors and question if they are genuine errors and not the act of an adversary passing off as an authorised user. Hence, we should take precautions by implementing policies such as access controls, configuration management, business continuity plan (BCP) and disaster recovery plan (DRP).

A DRP consists of processes that an organisation takes to recover from critical systems and network failures. The actions taken should be consistent in the event of a disruption. It should also clearly state the steps that should be taken in the event of a major disruption to protect the organisation from a disaster. During the planning process, organisations should consider their business function, identify the critical systems and the time required for recovery, and maintain a backup site if possible. The DRP should be tested and reviewed periodically, especially with major changes to the organisation, such as business function re-organisation and unplanned adjustments of daily operations due to national/global pandemic.

Review and testing of the DRP can be as simple as doing a checklist review, tabletop exercise or structured walkthrough test where members of the incident response (IR) team meet to discuss the steps/walkthrough the DRP, to having the IR team perform a simulation test without initiating recovery procedures. However, a full interruption test where the operational site is being shut down to execute the DRP fully is strongly discouraged. This kind of test is dangerous and could result in critical business/system/network failure if not properly managed/executed.

A BCP outlines the processes involved to ensure that critical business function continues to operate in a state of disruption that is unplanned for. All endpoints (including workstations/laptops, network infrastructure, etc.), application software, manpower resources need to be accounted for. Unplanned events can take the form of a natural disaster (e.g., earthquake, typhoon, tsunami, etc.,) or man-made (e.g., war, arson, sabotage, human error, etc.). The BCP will similarly, like the DRP, need to be tested and reviewed periodically.

Organisations should also take access controls into consideration while planning for BCP and DRP. Adversaries could make use of any opportunities, especially in times of chaos where security is most lax to infiltrate into an organisation. Access controls should therefore encompass physical and network security controls.

Organisations should also practice good configuration management controls. Configuration management is not only about keeping the systems’ baselines up to date. It includes proper documentation control and detection capabilities for any changes (planned or unplanned) made to the system.

However, employing all the controls mentioned above does not prevent the occurrence of a single point of failure. It merely reduces the possibility of it happening or reduce the downtime should an incident occurs.

 

Are there any other ways that can pre-empt us of a possible failure that is going to occur? Are there any detections that we could possibly have?

We should also monitor our system logs for any possible tell-tale signs of system failure. If senior management or stakeholders are supportive, there could also be some research in experimental machine learning (ML) or artificial intelligence (AI) driven predictive models where past data is used to predict potential timeframes failures may occur. Having said that, the ML/AI models should only be used to support decision-making processes. Without extensive design, testing and validation, being fully reliant on an experimental failure detection system could be risky. 

 

Who should then be responsible for monitoring the logs? Should it be our Security Operations Centre (SOC) analyst?

A SOC analyst8 needs to take on many hats. They need to monitor all activities that occur in the SOC, which could include analysing and reacting to undisclosed software and hardware vulnerabilities, user activities, network alerts and signals from security tools to identify events, to mention a few. Requiring the SOC analyst to monitor system logs on top of their existing responsibilities, could inadvertently cause them to suffer from alert fatigue due to the number of alerts/events they have to monitor.

 

Should we pass on the job of monitoring the system logs to the system administrators then?

This could be a possible solution, but the organisation will need to have proper processes in place for the system administrators so that they will know how to triage should there be any possible events triggered. System owners may not be the ones receiving the first-hand alerts if they are not actively monitoring the system logs. In such a scenario, there could be a delay in response should an incident occur. Hence the dev-ops team may be in a better position to be the ones to monitor these alerts since they are involved in all stages of change management. Should any alerts arise while they are configuring or upgrading the system, they will have first-hand knowledge and can try to arrest the situation before it gets worse. This can hopefully result in zero or less downtime for the organisation.

 

Conclusion

The key to success for every project is to find the right balance between security and functionality. Even though the “security by design” concept is strongly encouraged, organisations should constantly review their security policies like access controls, change management, BCP and DRP as they play an important role in managing security incidents.

While we know that Facebook has taken extensive steps to harden their infrastructure,9 which is also the result of the long recovery time this misconfiguration has caused, perhaps it is time for them to review their configuration management, BCP and DRP to prevent such long outages from happening again given the current COVID-19 situation and that a backup site may not be possible. It would also probably depend on the organisation’s risk appetite unless they could introduce some fault tolerance mechanism.

 

*Facebook, Inc. has changed its company name to Meta Platforms, Inc. on 28 October 2021.10

 

References:

1https://time.com/6105022/instagram-outage-small-business-impact/

2https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

3https://www.independent.co.uk/life-style/gadgets-and-tech/when-facebook-back-why-not-working-b1932150.html

4https://www.cnbc.com/2021/10/05/facebook-says-sorry-for-mass-outage-and-reveals-why-it-happened.html

5https://www.euronews.com/next/2021/06/09/fastly-outage-what-caused-it-and-do-internet-cdns-have-too-much-power

6https://www.bleepingcomputer.com/news/security/akamai-dns-global-outage-takes-down-major-websites-online-services/

7https://www.techopedia.com/definition/4351/single-point-of-failure-spof

8https://www.csoonline.com/article/3537510/soc-analyst-job-description-salary-and-certification.html

9https://www.independent.co.uk/life-style/gadgets-and-tech/facebook-down-global-outage-cause-b1933119.html

10https://about.fb.com/news/2021/10/facebook-company-is-now-meta/