September 13th, 2013 Service Disruption – post mortem
The SurveyToGo system was not accepting new data on Friday, September 13th 2013 from ~9:30pm – 11:30pm GMT+2. During the service disruption offline data collection was not affected however, there was no ability to upload collected data and to make changes to any collected data. Once the service was restored all collected data was uploaded and the SurveyToGo system was operating as usual.
The origin of this outage is a disk that is used for database transaction log backups that got full which in turns prevented from the transaction log backup to occur. As a reaction the transaction log grew bigger and quickly consumed its entire disk space which caused any inserts or updates to the system to fail. Retrieving information as well as logging into the system still worked.
While SurveyToGo runs in a fully mirrored environment to overcome server issues, the mirroring system relies on the transaction log of the primary host and thus could not provide a solution in this case. Our data center team was alerted within minutes after this occurrence and started handling the issue immediately. Due to the big size of the transaction log it took around 1.5 hours to complete the emergency procedure to restore the transaction log to a working state. Within 1.5 hours most system abilities were restored with the full system being back in full production mode half an hour later.
Every service disruption is a chance to learn and make ourselves better. In this case, our main monitoring system that alerts for any service disruption is based (among others) on a mechanism that simulates a login to the system. As in this case the login facility still worked it is clear we need to improve the monitoring system to check for uploading data as well as for login to generate a more “real life” test of the system. In addition, our disk monitors were set to alert with a threshold to small so that by the time they went off there was little to do without a service disruption. We will change the thresholds on disks to be bigger.
We are constantly trying to improve our service offering and availability is a major aspect of the service. We will be using this disruption to learn and improve our availability offering in order to provide you with the best service we can. Finally, we wish to extend a very warm thank you to all of our customers who were very understanding of the issues and patiently waited while we investigated and worked on bringing the service back up safely.
Visit us at: http://www.dooblo.net