Learning Postmortem Analysis through Metrics and KPIs: A Case Study of a Banking App Outage
Introduction
A postmortem analysis is the process of investigating and documenting the cause of a failure or outage. It is an essential tool for improving the reliability and performance of systems.
Metrics and KPIs (key performance indicators) are essential tools for conducting a postmortem analysis. They provide data that can be used to identify the root cause of the failure and develop corrective actions.
A postmortem is an important step in the lifecycle of an always-on service. The findings from your postmortem should feed right back into your planning process. This ensures that the critical remediation work identified in the postmortem finds a place in upcoming work and is balanced against other upcoming work and priorities.
In this blog, we'll talk about the use of postmortems and incident reports for troubleshooting web stacks, as well as look at a case study that demonstrates the value of exhaustive research and documentation.
The Significance of Incident Reports and Postmortems in Software Debugging
Identifying Root Causes: Postmortems and incident reports give teams a comprehensive understanding of the problem's core cause, enabling them to locate it and successfully address it.
Implement Preventive Measures: Once teams have identified the cause of the issue, they can establish protective measures to prevent similar occurrences in the future, thereby improving the system's overall reliability and stability.
Enhance Communication: Postmortems and incident reports enhance communication by providing a clear and detailed account of what happened during the incident, which enables team members to better understand the issue and its resolution. This shared understanding helps team members communicate more effectively with each other and work collaboratively to prevent similar issues from arising in the future. By fostering a culture of transparency and collaboration, incident reports and postmortems can improve the overall effectiveness of the software development process.
Metrics and KPIs for Postmortem Analysis
Number of customers affected
Duration of the outage
Impact of the outage on customer satisfaction
Cost of the outage
Root cause of the outage
Corrective actions
By gathering data from these sources, you can gain a better understanding of the outage and develop corrective actions to prevent future outages.
Case Study: Banking App Outage
On February 15th, 2023, between 9:00 a.m. and 12:00 p.m. (WAT), a major banking app experienced an outage that lasted for several hours. The outage prevented customers from accessing their accounts, making payments, and transferring funds. The bank's IT team conducted a postmortem analysis to investigate the cause of the outage. They used a variety of metrics and KPIs to gather data, including:
The number of customers affected by the outage
The duration of the outage
The impact of the outage on customer satisfaction
The cost of the outage to the bank
Root Cause: Through their analysis, the IT team uncovered the fundamental reason for the outage, identifying a database failure as the root cause.
In this case study, a database failure caused a banking app to become unavailable, preventing users from accessing their accounts and performing transactions. The engineering team was able to resolve the issue, but the investigation process was lengthy, and several misleading paths were taken.
When the issue was detected, the team began investigating the database, suspecting that it may have been the root cause. After reviewing the database logs and monitoring tools, the team found that the database was experiencing high load and was unable to handle the volume of requests coming from the banking app.
The team attempted to resolve the issue by increasing the database's capacity and optimizing its configuration, but this did not resolve the issue. It was only after exploring other potential causes, including network issues and server failures, that the team identified a misconfiguration in the database's replication setup.
The team discovered that the database's replication setup was misconfigured, causing the replication process to fail, and the database to become unavailable. The team immediately corrected the configuration issue and restored the database replication process, resolving the issue.
After conducting the postmortem analysis, the engineering team identified several areas that needed to be addressed to prevent similar issues from arising in the future. These areas include:
Database Monitoring: The team identified that they did not have adequate monitoring in place to detect database issues before they escalated. The team implemented a new monitoring system to proactively identify issues before they impacted users.
Disaster Recovery: The team identified that they did not have a comprehensive disaster recovery plan in place to recover from database failures. The team established a new disaster recovery plan and tested it regularly to ensure that it would be effective in the event of a failure.
Load Testing: The team identified that they did not conduct adequate load testing on the system, leading to the database becoming overwhelmed during peak usage periods. The team implemented a new load testing process to ensure that the system could handle the expected volume of traffic.
Conclusion
This case study emphasizes the importance of thorough investigation, documentation, and postmortems in web stack debugging, particularly in the banking industry. By identifying the root cause of the problem and establishing measures to prevent similar issues from occurring in the future, the engineering team was able to restore the banking app availability and ensure that users had uninterrupted access to their accounts and transactions.