Summary
Between June 8 and June 9, 2025, multiple service disruptions occurred due to increased latency and request failures across HTTP APIs. These incidents were traced back to the behavior of the PostgreSQL autovacuum process and were addressed by operational mitigations and subsequent investigations.
Incident Timeline
June 8, 2025:
June 9, 2025:
Root Cause
An unexpected PostgreSQL autovacuum behavior (specifically, autovacuum acquiring an ACCESS EXCLUSIVE lock, which is highly unusual) caused a series of latency spikes and request failures across our system. The trigger was a data retention job rolled out in early May 2025, which deleted large volumes of old data from a key table responsible for authorization logic.
At first, the system appeared stable. However, 30 days after the initial clean-up job was completed, PostgreSQL’s autovacuum initiated heap truncation—an optimization that reclaims empty pages at the end of a table. This delay occurred because pages only become truncatable once they become empty, and we have a 30-day retention on the data.
Crucially, heap truncation requires an ACCESS EXCLUSIVE lock, the strictest type, which blocks all reads and writes to the table. This behavior was unexpected—autovacuum is typically non-blocking.
By default, PostgreSQL enables heap truncation, but it rarely occurs and is generally harmless. It only becomes problematic on very busy tables, like ours, where even a brief exclusive lock can significantly disrupt live traffic.
Mitigations
Conclusion
These incidents were caused by an unintended interaction between large-scale data retention and PostgreSQL's autovacuum behavior on critical tables. Remediation steps are now in place to prevent recurrence, and process adjustments are underway to better assess such risks before future retention rollouts.