Client Dashboard Intermittent Degradation
Resolved
Jul 29 at 08:20pm HDT
Summary
On July 29, Layer engineers discovered an ongoing issue that intermittently caused customer dashboards to hang indefinitely while loading data. The Layer team identified that the issue has been intermittently occurring for 2.5 weeks, affecting a total of 695 end users. A full fix was identified and pushed 90 minutes after identification, and the effected customer platform was notified.
Impact
- For periods of 30s - 7 minutes, customer dashboards would stay in loading state.
- The Profit and Loss report would not load.
- Other functionality including Bank Transaction categorization continued to function normally.
- A total of 695 users were impacted over 2.5 weeks.
- Only one Layer customer platform was affected.
Timeline
2025-03: Low impact exception indicating a full data processing queue was first thrown in Layer system. At the time, it was happening in a non-critical system that did not have user impact.
2025-07-13: Legacy system migration completes, and additional load is migrated to a new processing queue. The queue begins to intermittently be overwhelmed, at first for just seconds at a time. The full data processing queue exception is thrown, but Layer engineering team believes it to be the same cause as the previous low-impact exception and no action is taken.
2025-07-29: Layer engineers notice profit and loss queries are not loading for a customer while testing an unrelated change. Immediate investigation reveals the issue is widespread for end-users with specific internal settings.
Resolution
- Error investigation determined that the error traced back to the sensitive resource queue.
- Further investigation revealed that the queue was full because of a shared resource across multiple work streams.
- Queue resources were separated and normal performance resumed.
- Client experience was manually verified to be working again, and logging was put in place to gain visibility any recurring incidence.
Next Steps
- [Done] Implement safeguards against shared resources of performance sensitive queues.
- Establish structured triage of ongoing exceptions, including those believed to be initially unimportant.
- Implement specific alerting for performance degradations of customer requests.
Affected services