Degradation of the GraphQL API

Incident Report for Spacelift

Postmortem

Summary

Between June 8 and June 9, 2025, multiple service disruptions occurred due to increased latency and request failures across HTTP APIs. These incidents were traced back to the behavior of the PostgreSQL autovacuum process and were addressed by operational mitigations and subsequent investigations.

Incident Timeline

June 8, 2025:

05:18 – 05:38 UTC: Service degradation.
21:20 – 21:40 UTC: Another period of degradation.

June 9, 2025:

02:44 – 03:05 UTC: Additional disruption.
Post-incident: Investigation and fix deployed.

Root Cause

An unexpected PostgreSQL autovacuum behavior (specifically, autovacuum acquiring an ACCESS EXCLUSIVE lock, which is highly unusual) caused a series of latency spikes and request failures across our system. The trigger was a data retention job rolled out in early May 2025, which deleted large volumes of old data from a key table responsible for authorization logic.

At first, the system appeared stable. However, 30 days after the initial clean-up job was completed, PostgreSQL’s autovacuum initiated heap truncation—an optimization that reclaims empty pages at the end of a table. This delay occurred because pages only become truncatable once they become empty, and we have a 30-day retention on the data.

Crucially, heap truncation requires an ACCESS EXCLUSIVE lock, the strictest type, which blocks all reads and writes to the table. This behavior was unexpected—autovacuum is typically non-blocking.

By default, PostgreSQL enables heap truncation, but it rarely occurs and is generally harmless. It only becomes problematic on very busy tables, like ours, where even a brief exclusive lock can significantly disrupt live traffic.

Mitigations

Immediate: Failovers were manually initiated to recover service during each incident.
Permanent: Heap truncation has been disabled on the affected table, and we are re-evaluating our data retention approach for tables like this to avoid future locking risks during maintenance operations.

Conclusion

These incidents were caused by an unintended interaction between large-scale data retention and PostgreSQL's autovacuum behavior on critical tables. Remediation steps are now in place to prevent recurrence, and process adjustments are underway to better assess such risks before future retention rollouts.

Posted Jun 11, 2025 - 15:42 UTC

Resolved

Between 02:44 and 03:05 our GraphQL API experienced an increased latency and a high error rate, causing most requests to fail. This was ultimately due to an issue with the autovacuum process of our Postgres database. More details are available in the post-mortem.

Posted Jun 09, 2025 - 02:44 UTC