Home > SurveyToGo Service > September 13th, 2013 Service Disruption – post mortem

September 13th, 2013 Service Disruption – post mortem

Overview

The SurveyToGo system was not accepting new data on Friday, September 13th 2013 from ~9:30pm – 11:30pm GMT+2. During the service disruption offline data collection was not affected however, there was no ability to upload collected data and to make changes to any collected data. Once the service was restored all collected data was uploaded and the SurveyToGo system was operating as usual.

Root Cause

The origin of this outage is a disk that is used for database transaction log backups that got full which in turns prevented from the transaction log backup to occur. As a reaction the transaction log grew bigger and quickly consumed its entire disk space which caused any inserts or updates to the system to fail. Retrieving information as well as logging into the system still worked.

While SurveyToGo runs in a fully mirrored environment to overcome server issues, the mirroring system relies on the transaction log of the primary host and thus could not provide a solution in this case. Our data center team was alerted within minutes after this occurrence and started handling the issue immediately. Due to the big size of the transaction log it took around 1.5 hours to complete the emergency procedure to restore the transaction log to a working state. Within 1.5 hours most system abilities were restored with the full system being back in full production mode half an hour later.

Lessons learned

Every service disruption is a chance to learn and make ourselves better. In this case, our main monitoring system that alerts for any service disruption is based (among others) on a mechanism that simulates a login to the system. As in this case the login facility still worked it is clear we need to improve the monitoring system to check for uploading data as well as for login to generate a more “real life” test of the system. In addition, our disk monitors were set to alert with a threshold to small so that by the time they went off there was little to do without a service disruption. We will change the thresholds on disks to be bigger.

Looking forward

We are constantly trying to improve our service offering and availability is a major aspect of the service. We will be using this disruption to learn and improve our availability offering in order to provide you with the best service we can. Finally, we wish to extend a very warm thank you to all of our customers who were very understanding of the issues and patiently waited while we investigated and worked on bringing the service back up safely.

Visit us at: http://www.dooblo.net

Categories: SurveyToGo Service
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: