Severe Service Degradation: OBS Unavailable

by the OBS Team posted on 11th Dec 2023

There was a service degradation of our reference server.

On December 7, 2023 for 35 minutes the response time of OBS was slow for anyone trying to use the server and in many cases connections were even dropped completely with an error message: “This website is under heavy load (queue full)”.

We want to give you some insight into what happened and what we are doing to avoid similar problems in the future.

Detection

We got notified through automatic alerts coming from our monitoring. Additionally people affected, used IRC to tell us about their problem.

Root Cause

An unusual high amount of HTTP requests where made to the interconnect API and exhausted our request queue. The majority of requests came from one such instance.

Trigger

Whenever a major event (large drop of maintenance updates etc.) happens for openSUSE Leap and/or SUSE Linux Enterprise distributions many OBS instances that connect to our reference server, via the interconnect feature, start to schedule rebuilds against those changes.

Resolution

We “just” weathered the storm for now and created follow up action items:

Action Item	Owner
Review concurrency for interconnect requests	Backend Developer Team
Review project setups in SUSEs OBS to avoid unnecessary requests from this instance	Backend Developer Team
Long term: Scaling Interconnect Feature #15348	Developer Team

Lessons Learned

First and foremost: You don’t have scaling problems until you have scaling problems!

What went well?

Automatic alerts from our monitoring informed us about problems quickly

What went wrong?

It took us too long to declare the incident
We did no declare the incident resolved

Where we got lucky?

Production logs included information where bulk requests were coming from and that OBS instance is under our control too.

Timeline (CET)

12:52 We received alerts about application performance
12:58 We realized from our monitoring that we were dropping connections
13:01 People started to complain on our IRC support channel about dropped connections
13:05 The application performance went back to acceptable levels
13:15 We declared the incident
13:17 Our monitoring declared the alerts as resolved