Post-Mortem: Rack Gem Version Mismatch on May 31, 2022

There was a severe service degradation of our reference server. On 2022-05-31 a deployment of OBS failed and led to a downtime. We want to give you some insight into what happened.

Impact

Our reference server was offline for 27 minutes. No one was able to work with the API or user interface during that time. Other services depending on OBS (like https://software.opensuse.org) were taken down by this as well.

Root Causes

Our deployment is based on the passenger app server. We deploy this gem via an RPM package to the system. The passenger gem requires another ruby gem: rack. We deploy this gem requirement to the system also via an RPM package.

Now to make things complicated, the OBS application also requires the ruby gem rack to function. We deploy the ruby gems the OBS application requires to an application-specific directory (/usr/lib64/obs-api).

If you have been paying attention to the post mortem from beginning of this month you probably know where this is heading.

On the day before we deployed we updated the ruby gem rack in our bundle and in our OBS repository. Then we deployed the OBS application bundle. But we did not update the system ruby gem rack. Which lead to a mismatch of the rack version passenger loaded from the system (2.2.3) and the rack version the application needed for it’s bundle (2.2.3.1). Can’t have both versions loaded at the same time, so passenger was unable to boot the OBS application.

Trigger

Deploying changes to production.

Detection

The deployment showed failures, we received alerts from our monitoring system and users informed via different channels.

Resolution

Updating the rack package (zypper up ruby3.1-rubygem-rack)

Action Items

Lessons Learned

What Went Well

Collaboration among the team to resolve this.

What Went Wrong

We haven’t resolved all the action items from the strscan incident yet. Bringing the CI ruby gem setup closer to production maybe could have saved us.

There was a handover between operators on Monday (the day the changes got implemented) and Tuesday (the day the changes got deployed). This didn’t go too well.

  • On Monday evening we did not block the deployment / informed people even though we were aware of possible problems
  • On Tuesday we deployed without checking in with the previous operator

Where We Got Lucky

No permanent damage or data loss.

Timeline (times in UTC)

  • 08:33 Start the deployment which isn’t successful.
  • 08:35 Deployment failed.
  • 08:36 Detect the server is down.
  • 08:40 Receive alerts from our monitoring system.
  • 08:44 We informed people on different channels.
  • 08:51 Try to restart Apache.
  • 08:55 See the message on the Passenger log: “You have already activated rack 2.2.3, but your Gemfile requires rack 2.2.3.1. Prepending bundle exec to your command may solve this. (Gem::LoadError)”
  • 08:58 Change the rack gem version on Gemfile.lock and restart Apache. We had the impression that the problem wasn’t solved, but it was a matter of time.
  • 09:01 Update the package in the system: zypper up ruby3.1-rubygem-rack.
  • 09:02 Change back the rack gem version on Gemfile.lock and restarted Apache. OBS is back.