Back to overview

ECS task failures and service disruption due to ECR image expiry

Apr 7, 2026 at 7:58pm UTC
Affected services
app.latitude.so/api/health

Resolved
Apr 7, 2026 at 7:58pm UTC

Executive summary

A prolonged gap in production image deployments meant our ECR lifecycle policy (seven-day retention) expired container images that were still referenced by active ECS task definitions. When ECS later attempted to replace tasks (routine churn) or scale services, new tasks failed to start because the required image digests/tags were no longer available in ECR. This resulted in service disruption and customer-facing downtime.

Detection was delayed because third-party monitoring alerts were automatically paused with little notice, and the incident was first identified via user reports.

Service was restored by rebuilding and pushing a valid image to the affected repositories and redeploying the impacted services. We have removed automated ECR image expiry from the production application stacks to prevent recurrence of this failure mode.


Impact

Latitude web app was unavailable from 7:51pm CEST to 9:47pm. All other Latitude services, including Latitude API and background jobs, were unaffected by this issue.


Root cause

ECR lifecycle policies were configured to expire images based on age (seven days since push). While the product team was focused on an application rewrite, normal release cadence paused and no new images were published within that window. The lifecycle rules removed the only (or still-referenced) images in the repository.

ECS then created new tasks (for example due to autoscaling, deployments, or routine replacement). Those tasks referenced image URIs that pointed at content ECR had deleted. The scheduler could not satisfy the task definition’s image requirement, so tasks failed to start, reducing healthy capacity below what the load balancer or service required — downtime or severe degradation.


Contributing factors

  1. Tight retention vs. deploy cadence: A fixed seven-day expiry assumed continuous or frequent pushes. Under normal operations this is fine but a deliberate pause in releases violated that implicit assumption.
  2. Coupling of scale/replace to image availability: Any event that starts new tasks depends on artifact availability in the registry, not only on cluster health.
  3. Limited visibility: Absence of, or insufficient alerting on, impending image expiry or task pull failures as an early signal before user impact.
  4. Third-party monitoring gap: Alerts from an external monitoring provider were automatically paused with little advance notice, removing a primary detection path at the worst time and delaying awareness until customers reported the outage.
  5. Operational gap: No documented runbook for long-lived branches / paused deploys that called out registry retention as a risk.