Incidents | Latitude Incidents reported on status page for Latitude https://status.latitude.so/ https://d1lppblt9t2x15.cloudfront.net/logos/c7afe3c41f6722fca005cc9c9be14f07.png Incidents | Latitude https://status.latitude.so/ en ECS task failures and service disruption due to ECR image expiry https://status.latitude.so/incident/866549 Tue, 07 Apr 2026 19:58:00 -0000 https://status.latitude.so/incident/866549#08111b72664f985e5860d2ed1519053a487188d350ab01857f15bd6c68eec10a ## Executive summary A prolonged gap in production image deployments meant our ECR lifecycle policy (seven-day retention) expired container images that were still referenced by active ECS task definitions. When ECS later attempted to replace tasks (routine churn) or scale services, new tasks failed to start because the required image digests/tags were no longer available in ECR. This resulted in service disruption and customer-facing downtime. Detection was delayed because third-party monitoring alerts were automatically paused with little notice, and the incident was first identified via user reports. Service was restored by rebuilding and pushing a valid image to the affected repositories and redeploying the impacted services. We have removed automated ECR image expiry from the production application stacks to prevent recurrence of this failure mode. --- ## Impact Latitude web app was unavailable from 7:51pm CEST to 9:47pm. All other Latitude services, including Latitude API and background jobs, were unaffected by this issue. --- ## Root cause ECR lifecycle policies were configured to expire images based on age (seven days since push). While the product team was focused on an application rewrite, normal release cadence paused and no new images were published within that window. The lifecycle rules removed the only (or still-referenced) images in the repository. ECS then created new tasks (for example due to autoscaling, deployments, or routine replacement). Those tasks referenced image URIs that pointed at content ECR had deleted. The scheduler could not satisfy the task definition’s image requirement, so tasks failed to start, reducing healthy capacity below what the load balancer or service required — downtime or severe degradation. --- ## Contributing factors 1. Tight retention vs. deploy cadence: A fixed seven-day expiry assumed continuous or frequent pushes. Under normal operations this is fine but a deliberate pause in releases violated that implicit assumption. 2. Coupling of scale/replace to image availability: Any event that starts new tasks depends on artifact availability in the registry, not only on cluster health. 3. Limited visibility: Absence of, or insufficient alerting on, impending image expiry or task pull failures as an early signal before user impact. 4. Third-party monitoring gap: Alerts from an external monitoring provider were automatically paused with little advance notice, removing a primary detection path at the worst time and delaying awareness until customers reported the outage. 5. Operational gap: No documented runbook for long-lived branches / paused deploys that called out registry retention as a risk. app.latitude.so/api/health recovered https://status.latitude.so/ Tue, 07 Apr 2026 19:46:55 +0000 https://status.latitude.so/#5e869d135e9a4dabb2bb8b165d2f054a041de34e04bf843f235460e13b495ea8 app.latitude.so/api/health recovered app.latitude.so/api/health went down https://status.latitude.so/ Tue, 07 Apr 2026 19:41:20 +0000 https://status.latitude.so/#5e869d135e9a4dabb2bb8b165d2f054a041de34e04bf843f235460e13b495ea8 app.latitude.so/api/health went down May 30th, 2025 – Postmortem https://status.latitude.so/incident/594759 Sat, 31 May 2025 18:06:00 -0000 https://status.latitude.so/incident/594759#4cb588adf7d9330510f4921f14adc35d7781bacc4981ea0117cb773eb763ffda ## Incident Summary On 30 May 2025 at 18:22 CEST, the Gateway proxy experienced an outage that lasted two hours and twenty-two minutes. During this interval all customer traffic routed through the proxy failed, resulting in a complete loss of service. ## Root Cause The incident was triggered by an unexpected surge in incoming requests that drove a corresponding spike in outbound calls to external LLM providers, primarily OpenAI. Although the autoscaler immediately began launching additional Gateway instances, the increased egress traffic exhausted the outbound bandwidth of the NAT Gateway that serves the cluster. Each new instance had to download its Docker image through the same congested link, causing image pulls that normally complete in under a minute to stretch beyond twenty minutes. Because capacity scaled up so slowly, the existing Gateway instances became overwhelmed and unhealthy, which led to a total shutdown of the service. ## Resolution Once engineers confirmed that NAT egress saturation was the bottleneck, they first detached the Gateway service from incoming traffic to prevent further failures. They then replaced the overloaded NAT Gateway with a higher-capacity unit and started a minimal set of fresh Gateway instances behind the new network layer. After each instance passed its health checks, traffic was gradually reintroduced. Full functionality returned two hours and twenty-two minutes after the initial failure. ## Preventive Measures The NAT Gateway is now part of an autoscaling group so network capacity can expand automatically with demand. New monitoring thresholds generate alerts whenever NAT egress approaches saturation, giving operators early warning before user impact occurs. In addition, the baseline number of Gateway instances has been increased to provide greater headroom during sudden traffic spikes. We apologise for the disruption and remain committed to strengthening our infrastructure to prevent similar incidents in the future. Cache & Database upgrade https://status.latitude.so/incident/582904 Sat, 31 May 2025 09:15:00 -0000 https://status.latitude.so/incident/582904#acd736369ad51ed96139f510bde5f8df47275db7900010720f9ebc63a14a03d2 Maintenance completed Cache & Database upgrade https://status.latitude.so/incident/582904 Sat, 31 May 2025 08:00:00 -0000 https://status.latitude.so/incident/582904#6e30c55ac28bbe3d282542cd50bc2a759d109994710915bea5be4091650e66fa We are going to upgrade our main database and cache cluster instances with higher cpu and memory in order to accomodate capacity from new customers. This change will cause downtime of some minutes.