Total Downtime: 4 hrs 10 min, 3am - 7:10am PST / 6am - 10:10am EST, with core outage of 2hrs 10min on dashboard
All times in EST.
6:00: Pingdom reports API and Dashboard as down in #alerts
6:15: PagerDuty alerts oncall engineer
6:25: PagerDuty escalates to management
6:45: Management acknowledges the incident in PagerDuty to prevent further escalation
7:00: CS escalates to lead engineers via text and slack
7:15: Lead engineer confirms all API instances are unreachable by load balancer, unable to fetch Elastic Beanstalk logs
7:20: Lead engineer confirms instances are accessible via SSH
7:25: Lead engineer is unable to restart app servers for production API: "Unable to execute command on instances"
7:30: Lead engineer creates a clone of production environment while continuing to investigate outage
7:45: Lead engineer observes Dashboard receives error "Server Unavailable: Back-end server is at capacity"
8:00: Cloned environment setup, switch DNS entry of api.raydiant.com to point to clone
8:05: Engineering and CS confirms Dashboard is working, users can login and API is reachable by devices
8:30: Lead engineer observes file uploads are not working a devices are unable to connect to PubSub with "403: Unexpected Server Response"
8:45: Lead engineer rebuilds production environment and upgraded the instances from m1.medium to m1.large
9:05: Production environment rebuilt, Lead engineer observes devices are able to connect to PubSub, file uploads are working and customers can log into the Dashboard again
Best guess is one or more instances ran out of disk space causing API errors
Difficult to determine as the problematic instances were torn town during instance upgrade
7:20 EST: All API instances unreachable (Elastic Beanstalk → mira-api-production → Health).
9:30 EST: Devices unable to reach PubSub due to API errors starting at 6 EST. The error rate drop off at 9am corresponds to the production API going back up after the rebuild. Some devices still getting the WS error.
10:30 EST: Observed no more WS errors starting just before 10am EST.
Filesystem % of EB instance. File system started to increase exponentially around 5am EST (2am PST).
Pingdom healthcheck for API alerts as dead, escalates as severe 1st pass, sneds alert to Slack
Pingdom "synthetic transaction" craters, unable to log in to dashboard, escalates as severe 2nd pass, triggers PagerDuty escalation
* No alerts on disk usage or didn't notify the right channels \(Slack, PagerDuty\)
* Work wasn't prioritized
* "Network out" scaling triggered is set way too low \(2mb → 10mb\)