Post-mortem: Database Cluster Crashes

by the OBS Team posted on 3rd Aug 2023

Downtime on the afternoon of 3rd of August

On 3. August, a few hours after a large migration performed within the maintenance window earlier that day, we experienced multiple downtimes while recovering from database inconsistencies.

Date: 03.08.2023

Impact: Multiple downtimes throughout the day.

Root Causes: Our database cluster ran out of available space during a large schema/data migration (#14597 - Migrate the remaining database tables and columns to utf8mb4)

Trigger: Morning deployment and migration from utf8mb3 to utf8mb4.

Resolution: The tables were dumped and restored from scratch.

Detection: Our database admins got notified via their monitoring.

Lessons Learned

What went well?

We learned about the database crash soon after it happened.

What went wrong?

The current way we deploy with migration does not log the progress of migrations or inform us about things happening in real time (improvement card).
We did not communicate this migration with our database admins in advance to make them aware of potential fallout.

Where we got lucky?

Only four tables ended up being affected.
Our database admins where around to help us in getting the database back to usable state.

Timeline (CEST)

09:03 Started the deployment with the migration
09:26 Ended the deployment
09:31 First recorded error in the index of project_log_entries table
13:37 Database cluster crashes
14:04 Build Service goes into downtime to export project_log_entries table
14:14 Started project_log_entries table import
14:59 Build Service comes back from downtime
17:40 Database cluster crashes again
17:56 We learn about binary_releases table index being broken
18:14 Started binary_releases table export without downtime
18:18 Started binary_releases table import
18:46 Finished import
18:50 We start performing CHECK TABLE on the rest of the tables in the database
19:08 We find out about bs_request_actions table being broken and take Build Service down for maintenance
19:16 Build Service comes back up after all the tables went through CHECK TABLE