Message boards :
Status info :
Unexpected outage on Friday, September 2nd
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
Hi, Here are the technical details on our downtime on Friday, September 2nd 2011 for reference. Please, feel free to ask questions if there are some details you think are missing. Server problems seem to have started on Friday around noon on September 2nd local time (EEST+3, so that's 9:00 UTC and 2:00 PDT) with first round of OOM kills of MySQL DB and error messages. There was another round 4 hours later at 16:00 until finally at 21:00 that evening the server seems to died. I restarted the server 6 hours later at 3:15 and full services should have been restored at 3:30 (0:30 UTC and 17:30 PDT). I can see successful requests until 21:00, which means the full outage should have been limited to 6 hours starting from that time. However, it's possible that server response times might have been affected and that BOINC clients were returning errors due to these events. This seems to have been similar to the outage we had about month ago. There was a high load on the server with out-of-memory errors with OOM-killer taking out MySQL DB (since it uses most of the memory). I have also found evidence of AVC and audit errors that suggest SELinux might be a factor. These errors could just be symptoms of the memory exhaustion but I'll still investigate that side as well. I've already done some configuration changes that will hopefully help to minimize such events in the future. For now we'll keep the result only 14 days on the DB as the DB size is a big factor on memory usage. I would be interested to hear if anyone has any objections to store the result for even shorter time? This is probably not necessary since like I mentioned last time, there are some other changes I can make to help with server load. -w |
Send message Joined: 5 May 11 Posts: 233 Credit: 351,414,150 RAC: 0 |
Personally I have no issues bringing it to 14 days, could even bring it down to 7 days, any less would start to get iffy with anyone chasing faults et al. Certainly coming down to seconds like a certain infamous Project is nuts :) Another thought for the medium term. If over time, it transpires that memory is a real issue - and the mainboard can take more - I suspect an appeal for the cash to buy the additional memory would be filled in short order, memory is not that expensive these days that we cannot just fill the mainboard and have done with it after an appeal for funding. |
Send message Joined: 20 Apr 11 Posts: 388 Credit: 822,356,221 RAC: 0 |
Another thought for the medium term. If over time, it transpires that memory is a real issue - and the mainboard can take more - I suspect an appeal for the cash to buy the additional memory would be filled in short order, memory is not that expensive these days that we cannot just fill the mainboard and have done with it after an appeal for funding. Database servers certainly benefit from having lot of memory to play with. Although, it's not about buying more memory since we rent our server but more like changing our server plan to one with with more memory. Unfortunately, our current service provider doesn't offer plans with more memory (at least not at the moment) but I have one on sight that does have better options in that area. At the moment it does seem the 14 days is working with current load as the DB size has come down. -w |