Post-mortem: Database Cluster Crashes
Downtime on the afternoon of 3rd of August
On 3. August, a few hours after a large migration performed within the maintenance window earlier that day, we experienced multiple downtimes while recovering from database inconsistencies.
Impact: Multiple downtimes throughout the day.
Root Causes: Our database cluster ran out of available space during a large schema/data migration (#14597 - Migrate the remaining database tables and columns to utf8mb4)
Trigger: Morning deployment and migration from utf8mb3 to utf8mb4.
Resolution: The tables were dumped and restored from scratch.
Detection: Our database admins got notified via their monitoring.
What went well?
- We learned about the database crash soon after it happened.
What went wrong?
- The current way we deploy with migration does not log the progress of migrations or inform us about things happening in real time (improvement card).
- We did not communicate this migration with our database admins in advance to make them aware of potential fallout.
Where we got lucky?
- Only four tables ended up being affected.
- Our database admins where around to help us in getting the database back to usable state.
- 09:03 Started the deployment with the migration
- 09:26 Ended the deployment
- 09:31 First recorded error in the index of
- 13:37 Database cluster crashes
- 14:04 Build Service goes into downtime to export
- 14:14 Started
- 14:59 Build Service comes back from downtime
- 17:40 Database cluster crashes again
- 17:56 We learn about
binary_releasestable index being broken
- 18:14 Started
binary_releasestable export without downtime
- 18:18 Started
- 18:46 Finished import
- 18:50 We start performing
CHECK TABLEon the rest of the tables in the database
- 19:08 We find out about
bs_request_actionstable being broken and take Build Service down for maintenance
- 19:16 Build Service comes back up after all the tables went through