Starting at Oct 21st 3:30 pm UTC, an aspect of Square’s transaction-reporting infrastructure began falling behind when processing incoming transactions. While this did not affect the execution of the transactions themselves, reporting relating to these transactions was either missing or out of date for several hours afterwards. This affected functionality such as the Transactions applets, as well as real-time systems like the Kitchen Display System. The outage lasted for up to six hours for certain systems/reporting subsets.
In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to avoid service interruptions like this in the future.
15:30: Square’s transaction-reporting infrastructure started falling behind in processing incoming transactions. This disrupted our real time systems like Kitchen Display System and the several other applications that depend on this data like transaction applets. This transaction slowdown then disrupted workflows like issuing refunds, attaching customers to transactions, and sending receipts.
15:35: Internal system alerts notified the team of the transaction processing slowdown – from there, an incident was reported, and we determined that the likely cause was a cloud migration that took place earlier in the week.
16:35: The team contacted our cloud provider, and database engineers joined the investigation to begin searching for the root cause.
18:30: Engineers found the databases to be healthy and determined that the number of jobs that were running to process these transactions could be increased for parallelism. We increased the number of jobs and started to see an increase in transactions processed.
20:30: We continued to increase the number of jobs to process all of the backlogged transactions. Teams saw the dashboard recover and all orders processed by 22H:30M.
22:30: At this time, we marked the incident as stable.
23:30: Within an hour payroll and team management systems caught up and fully recovered.
This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which have already been implemented.
Earlier in the week, we migrated our reporting infrastructure to the cloud to improve the scalability and reliability of our systems. Significant effort and diligence went into this migration to make sure there were no disruptions to seller workflows, although we were not able to stress-test our systems for peak traffic.
The root cause of this issue was that the latency to the new infrastructure was higher than it had been previously. While this is not a problem during normal traffic, weekend peak traffic made this issue more apparent, when the system couldn’t keep up with the increased latency. Since this is a new environment, it took us some time to figure out what was causing the system to fall behind. We initially posited that the databases may be underprovisioned, which was not the case. We later realized that scaling up the number of jobs to process these transactions could help – by modestly increasing the job count, we saw a rise in throughput, and from there, we scaled up job processing incrementally and cleared all backlogged transactions within an hour.
Additionally, we noticed that it took us longer to test changes like increasing the job count and scaling up databases, which delayed time to recovery. We identified several improvements – firstly, cleaning up conflicting configuration files that had the job counts configured in them, and secondly, improving tooling to be able to perform quick deploys in the cloud. We have also made a permanent change to scale up the number of jobs to process incoming transactions to ensure we can handle increased traffic throughout the holiday season. Additionally, we have taken action to improve our alerts to be more sensitive and to devise a plan to stress-test our systems periodically.