Estado de Square: España

Multiple Services Disruption
Incident Report for Square Spain
Postmortem

Incident Summary

Starting at Oct 21st 3:30 pm UTC,  an aspect of Square’s transaction-reporting infrastructure began falling behind when processing incoming transactions. While this did not affect the execution of the transactions themselves, reporting relating to these transactions was either missing or out of date for several hours afterwards. This affected functionality such as the Transactions applets, as well as real-time systems like the Kitchen Display System. The outage lasted for up to six hours for certain systems/reporting subsets. 

In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to avoid service interruptions like this in the future.

Timeline (UTC)

15:30: Square’s transaction-reporting infrastructure started falling behind in processing incoming transactions. This disrupted our real time systems like Kitchen Display System and the several other applications that depend on this data like transaction applets. This transaction slowdown then disrupted workflows like issuing refunds, attaching customers to transactions, and sending receipts.   

15:35: Internal system alerts notified the team of the transaction processing slowdown – from there, an incident was reported, and we determined that the likely cause was a cloud migration that took place earlier in the week.

16:35: The team contacted our cloud provider, and database engineers joined the investigation to begin searching for the root cause.

18:30: Engineers found the databases to be healthy and determined that the  number of jobs that were running to process these transactions could be increased for parallelism. We increased the number of jobs and started to see an increase in transactions processed.

20:30: We continued to increase the number of jobs to process all of the backlogged transactions. Teams saw the dashboard recover and all orders processed by 22H:30M.

22:30: At this time, we marked the incident as stable. 

23:30: Within an hour payroll and team management systems caught  up and fully recovered.

Analysis

This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which have already been implemented.

Earlier in the week, we migrated our reporting infrastructure to the cloud to improve the scalability and reliability of our systems. Significant effort and diligence went into this migration to make sure there were no disruptions to seller workflows, although we were not able to stress-test our systems for peak traffic. 

The root cause of this issue was that the latency to the new infrastructure was higher than it had been previously. While this is not a problem during normal traffic, weekend peak traffic made this issue more apparent, when the system couldn’t keep up with the increased latency. Since this is a new environment, it took us some time to figure out what was causing the system to fall behind. We initially posited that the databases may be underprovisioned, which was not the case. We  later realized that scaling up the number of jobs to process these transactions could help – by modestly increasing the job count, we saw a rise in throughput, and from there, we scaled up job processing incrementally and cleared all backlogged transactions within an hour.

Additionally, we noticed that it took us longer to test changes like increasing the job count and scaling up databases, which delayed time to recovery. We identified several improvements – firstly, cleaning up conflicting configuration files that had the job counts configured in them, and secondly, improving tooling to be able to perform quick deploys in the cloud. We have also made a permanent change to scale up the number of jobs to process incoming transactions to ensure we can handle increased traffic throughout the holiday season. Additionally, we have taken action to improve our alerts to be more sensitive and to devise a plan to stress-test our systems periodically.

Posted Nov 13, 2023 - 20:57 CET

Resolved
We have confirmed with our engineering team that this disruption has been resolved and Reporting is working as intended. We appreciate your patience today as we worked to resolve this.
Posted Oct 22, 2023 - 04:39 CEST
Update
Reports and services now contain the latest information and are working as expected. However, we continue to monitor the progress of Team Member Tips and Commission Reporting. For Square Staff and Square Payroll, at this time, tips and commissions imported today may not reflect the correct value until this disruption is fully resolved. We will post another update as soon as it becomes available. Thank you for your patience and understanding.
Posted Oct 22, 2023 - 02:57 CEST
Update
Reports and services now contain the latest information and are working as expected. However, we continue to monitor the progress of Team Member Tips and Commission Reporting. We will post another update as soon as it becomes available. Thank you for your patience and understanding.
Posted Oct 22, 2023 - 02:10 CEST
Update
At this time, most Reports and Services are up-to-date. However, the reporting of Cash and Other Tender Type Transactions is still in progress. Team Member Tip and Commission reports are also progressing. We will continue providing updates as they occur. Thank you for your patience and understanding.
Posted Oct 22, 2023 - 01:45 CEST
Monitoring
We continue monitoring the progress of the delayed reporting and services and keeping an eye on this to ensure all services are fully functional again. We will make sure to provide updates as they are available.

As a reminder, Payment Acceptance is not affected by this disruption and Sellers may continue processing payments as the delayed reporting update progresses. Thank you for sticking with us while we resolve this.
Posted Oct 21, 2023 - 23:46 CEST
Update
Our engineers have confirmed the disruption with Square KDS has been resolved. We are still monitoring the progress of the delayed reporting and services. As a reminder, Payment Acceptance is not affected by this disruption and Sellers may continue processing payments as the delayed reporting update progresses.

Thank you for your continued patience and understanding.
Posted Oct 21, 2023 - 23:23 CEST
Update
Our engineers have identified the root cause of this disruption and are actively working on the fix. However, we are now monitoring the progress of the delayed reporting and services. As a reminder, Payment Acceptance is not affected by this disruption and Sellers may continue processing payments as the delayed reporting update progresses. Thank you for your continued patience and understanding.
Posted Oct 21, 2023 - 23:07 CEST
Update
Our engineers are all hands on deck working on addressing this situation. To share more clarity at this time the delay extends to 45 minutes This means that if you complete a payment transaction it will likely appear in your reports after 45 minutes. We understand the importance of your reports being accurate and we are working hard on getting this situation addressed. Thank you for your continued patience.
Posted Oct 21, 2023 - 21:44 CEST
Update
We wanted to share an update on the current affected services. At this time, All reporting, Customer Directory, Transaction Reporting, Balance Reporting, Refund Requests, adding customers to sales or Appointments, and Square KDS orders are delayed. However, at this time it is completely safe to process transactions as these are not affected by this disruption. We continue following up on the progress for a fix and we will continue providing updates here. Thank you for your patience.
Posted Oct 21, 2023 - 20:06 CEST
Update
Our engineers continue to work on addressing the current disruption. At this time they are working on reducing the delay in reports and services while also working on a fix for the disruption. As a reminder Payment processing is not affected by this disruption, just the reporting of sales afterward. Thank you for your continued patience.
Posted Oct 21, 2023 - 20:06 CEST
Update
Engineers have confirmed that the current disruption is only affecting services and reporting. At this time, it is completely safe to continue processing payments. However, the reporting of these transactions is currently delayed. We will continue providing updates as they occur.
Posted Oct 21, 2023 - 19:35 CEST
Identified
Our engineers have identified a disruption that is causing the reporting of Sales, Inventory and other reports to be delayed. We know how important this is for you and your business and our team is actively working on a solution. Thank you for your patience and understanding. We will continue providing updates as they occur.
Posted Oct 21, 2023 - 18:47 CEST
Investigating
We are currently experiencing a disruption that is impacting some Square services. We understand how important it is for your business for all of our services to be up and running, and our Engineering team is actively working on a fix. Thank you for your patience with us as we work to resolve this issue.
Posted Oct 21, 2023 - 18:44 CEST