May 30th, 2025 – Postmortem

May 31 at 08:06pm CEST

Affected services

gateway.latitude.so/health

Resolved
May 31 at 08:06pm CEST

Incident Summary

On 30 May 2025 at 18:22 CEST, the Gateway proxy experienced an outage that lasted two hours and twenty-two minutes. During this interval all customer traffic routed through the proxy failed, resulting in a complete loss of service.

Root Cause

The incident was triggered by an unexpected surge in incoming requests that drove a corresponding spike in outbound calls to external LLM providers, primarily OpenAI. Although the autoscaler immediately began launching additional Gateway instances, the increased egress traffic exhausted the outbound bandwidth of the NAT Gateway that serves the cluster. Each new instance had to download its Docker image through the same congested link, causing image pulls that normally complete in under a minute to stretch beyond twenty minutes. Because capacity scaled up so slowly, the existing Gateway instances became overwhelmed and unhealthy, which led to a total shutdown of the service.

Resolution

Once engineers confirmed that NAT egress saturation was the bottleneck, they first detached the Gateway service from incoming traffic to prevent further failures. They then replaced the overloaded NAT Gateway with a higher-capacity unit and started a minimal set of fresh Gateway instances behind the new network layer. After each instance passed its health checks, traffic was gradually reintroduced. Full functionality returned two hours and twenty-two minutes after the initial failure.

Preventive Measures

The NAT Gateway is now part of an autoscaling group so network capacity can expand automatically with demand. New monitoring thresholds generate alerts whenever NAT egress approaches saturation, giving operators early warning before user impact occurs. In addition, the baseline number of Gateway instances has been increased to provide greater headroom during sudden traffic spikes.

We apologise for the disruption and remain committed to strengthening our infrastructure to prevent similar incidents in the future.