Production API Down and Dashboard Unavailable
Incident Report for Raydiant
Postmortem

Total Downtime: 4 hrs 10 min, 3am - 7:10am PST / 6am - 10:10am EST, with core outage of 2hrs 10min on dashboard

Timeline

All times in EST.

6:00: Pingdom reports API and Dashboard as down in #alerts

6:15: PagerDuty alerts oncall engineer

6:25: PagerDuty escalates to management

6:45: Management acknowledges the incident in PagerDuty to prevent further escalation

7:00: CS escalates to lead engineers via text and slack

7:15: Lead engineer confirms all API instances are unreachable by load balancer, unable to fetch Elastic Beanstalk logs

7:20: Lead engineer confirms instances are accessible via SSH

7:25: Lead engineer is unable to restart app servers for production API: "Unable to execute command on instances"

7:30: Lead engineer creates a clone of production environment while continuing to investigate outage

7:45: Lead engineer observes Dashboard receives error "Server Unavailable: Back-end server is at capacity"

8:00: Cloned environment setup, switch DNS entry of api.raydiant.com to point to clone

8:05: Engineering and CS confirms Dashboard is working, users can login and API is reachable by devices

8:30: Lead engineer observes file uploads are not working a devices are unable to connect to PubSub with "403: Unexpected Server Response"

8:45: Lead engineer rebuilds production environment and upgraded the instances from m1.medium to m1.large

9:05: Production environment rebuilt, Lead engineer observes devices are able to connect to PubSub, file uploads are working and customers can log into the Dashboard again

Root Cause

  • All 3 API instances were unreachable by ELB
  • Max instances is set to 3 so no more instances could be created
  • Best guess is one or more instances ran out of disk space causing API errors

    • API errors caused PubSub errors on devices
    • Devices trying to reconnect to PubSub caused more errors and thus more logs, causing all instances to run out of disk space
  • Difficult to determine as the problematic instances were torn town during instance upgrade

    • Next time detach an instance for debugging from the ELB before rebuilding the environment. EB will also spin up a new instance when one is removed.

Graphs

https://ibb.co/B30R79v

7:20 EST: All API instances unreachable (Elastic Beanstalk → mira-api-production → Health).

https://ibb.co/jwRkFQs

9:30 EST: Devices unable to reach PubSub due to API errors starting at 6 EST. The error rate drop off at 9am corresponds to the production API going back up after the rebuild. Some devices still getting the WS error.

https://ibb.co/tmGmX7N

10:30 EST: Observed no more WS errors starting just before 10am EST.

https://ibb.co/vXK5R7v

Filesystem % of EB instance. File system started to increase exponentially around 5am EST (2am PST).

https://ibb.co/NrMHJhp

Pingdom healthcheck for API alerts as dead, escalates as severe 1st pass, sneds alert to Slack

https://ibb.co/n80GzPD

Pingdom "synthetic transaction" craters, unable to log in to dashboard, escalates as severe 2nd pass, triggers PagerDuty escalation

Five Whys

  1. Why didn't we know things were getting bad before it went bad?
* No alerts on disk usage or didn't notify the right channels \(Slack, PagerDuty\)
  1. Why didn't Elastic Beanstalk auto-recreate the instance?
  • Proper health checks in CloudWatch alarm was triggered and notified an SNS group that should email engineering@getmira.com but did not; no subscribers
  • Max instances set to 3
  1. How come we never tested disk space failures?
* Work wasn't prioritized
  1. Why is EB scaling to max number of instances?
* "Network out" scaling triggered is set way too low \(2mb → 10mb\)

Action Items

  • Adjust EB scaling parameters by 10x (existing scaling limits are too low)
  • Document and train on rotating failing instances out of ELB so we can retain logs
  • Adjust/increase instance type for more storage (maybe use EBS)
  • Connect existing CloudWatch alarms to Pingdom
  • Create a CloudWatch dashboard (add to Notion links page)
Posted Nov 21, 2019 - 21:06 UTC

Resolved
This incident has been resolved.
Posted Nov 21, 2019 - 15:10 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 21, 2019 - 14:16 UTC
Identified
We have identified the cause of this issue and are testing a fix.
Posted Nov 21, 2019 - 13:30 UTC
Investigating
The Raydiant production API is down, preventing users from logging into or taking any action in the Raydiant Dashboard. We are investigating the cause of this outage.
Posted Nov 21, 2019 - 12:07 UTC
This incident affected: CX Services (Production Core API) and CX Websites (Production Dashboard).