Back to overview
Degraded

Incident Report: January 14, 2024 Degraded Services

Jan 14 at 05:27am HST
Affected services
api.layerfi.com

Resolved
Jan 14 at 10:48am HST

Summary

On January 14, 2024, from 1:09 AM to 12:48 PM PT, Layer’s API services were degraded. During this period, requests failed intermittently, with more than 50% of requests failing particularly between 7AM and 12:48PM PT.

Impact

  1. End-user SMB dashboards intermittently failed to load. SMBs would see loading states and error states. Refreshing would succeed in many but not all cases.
  2. 3rd party data syncing was delayed, but no data loss occurred.
  3. SMS delivery was significantly reduced, with only 15–20% of expected messages sent. All scheduled and queued messages were sent after recovery.
  4. API operations to update businesses failed. Updates to businesses including adding plaid items and processor tokens may need to be retried.

Root Cause

At ~1AM PT this morning, a customer began bulk loading historical data to our /payouts endpoint, which did not have any rate limiting protection. This bulk loading caused resource contention on our server’s database connections as these operations weren’t batched, rate limited, or optimized for bulk loading.

Resolution

Layer engineers became aware of the issue early this morning. Unfortunately the issue caused multiple secondary problems we mistakenly believed to be the root cause. The root cause was identified at 12:30pm PT, rate limits were added on the sensitive endpoint, and full functionality was restored at 12:48PM PT.

Follow-Up Actions:

We’re implementing the following safeguards over the next 24 hours:

  1. Rate Limiting: Layer is standardizing rate limit safeguards across all endpoints. We are setting limits that will not affect existing usage patterns and are intended to guard against exceptional scenarios like bulk loading large datasets. Documentation on all rate limits will be added shortly to Layer’s API documentation shortly.
  2. Profiling and Stress Testing: We are beginning a round of stress testing to simulate high loads on multiple request flows. We will be dedicating time to improving performance sensitive workflows.

Created
Jan 14 at 05:27am HST

Requests are intermittently failing.