Severe Service Degradation: OBS Unavailable
There was a service degradation of our reference server.
On December 7, 2023 for 35 minutes the response time of OBS was slow for anyone trying to use the server and in many cases connections were even dropped completely with an error message: “This website is under heavy load (queue full)”.
We want to give you some insight into what happened and what we are doing to avoid similar problems in the future.
Detection
We got notified through automatic alerts coming from our monitoring. Additionally people affected, used IRC to tell us about their problem.
Root Cause
An unusual high amount of HTTP requests where made to the interconnect API and exhausted our request queue. The majority of requests came from one such instance.
Trigger
Whenever a major event (large drop of maintenance updates etc.) happens for openSUSE Leap and/or SUSE Linux Enterprise distributions many OBS instances that connect to our reference server, via the interconnect feature, start to schedule rebuilds against those changes.
Resolution
We “just” weathered the storm for now and created follow up action items:
Action Item | Owner |
---|---|
Review concurrency for interconnect requests | Backend Developer Team |
Review project setups in SUSEs OBS to avoid unnecessary requests from this instance | Backend Developer Team |
Long term: Scaling Interconnect Feature #15348 | Developer Team |
Lessons Learned
First and foremost: You don’t have scaling problems until you have scaling problems!
What went well?
- Automatic alerts from our monitoring informed us about problems quickly
What went wrong?
- It took us too long to declare the incident
- We did no declare the incident resolved
Where we got lucky?
- Production logs included information where bulk requests were coming from and that OBS instance is under our control too.
Timeline (CET)
- 12:52 We received alerts about application performance
- 12:58 We realized from our monitoring that we were dropping connections
- 13:01 People started to complain on our IRC support channel about dropped connections
- 13:05 The application performance went back to acceptable levels
- 13:15 We declared the incident
- 13:17 Our monitoring declared the alerts as resolved