Archive

Archive for the ‘SurveyToGo Service’ Category

September 13th, 2013 Service Disruption – post mortem

September 15, 2013 Leave a comment

Overview

The SurveyToGo system was not accepting new data on Friday, September 13th 2013 from ~9:30pm – 11:30pm GMT+2. During the service disruption offline data collection was not affected however, there was no ability to upload collected data and to make changes to any collected data. Once the service was restored all collected data was uploaded and the SurveyToGo system was operating as usual.

Root Cause

The origin of this outage is a disk that is used for database transaction log backups that got full which in turns prevented from the transaction log backup to occur. As a reaction the transaction log grew bigger and quickly consumed its entire disk space which caused any inserts or updates to the system to fail. Retrieving information as well as logging into the system still worked.

While SurveyToGo runs in a fully mirrored environment to overcome server issues, the mirroring system relies on the transaction log of the primary host and thus could not provide a solution in this case. Our data center team was alerted within minutes after this occurrence and started handling the issue immediately. Due to the big size of the transaction log it took around 1.5 hours to complete the emergency procedure to restore the transaction log to a working state. Within 1.5 hours most system abilities were restored with the full system being back in full production mode half an hour later.

Lessons learned

Every service disruption is a chance to learn and make ourselves better. In this case, our main monitoring system that alerts for any service disruption is based (among others) on a mechanism that simulates a login to the system. As in this case the login facility still worked it is clear we need to improve the monitoring system to check for uploading data as well as for login to generate a more “real life” test of the system. In addition, our disk monitors were set to alert with a threshold to small so that by the time they went off there was little to do without a service disruption. We will change the thresholds on disks to be bigger.

Looking forward

We are constantly trying to improve our service offering and availability is a major aspect of the service. We will be using this disruption to learn and improve our availability offering in order to provide you with the best service we can. Finally, we wish to extend a very warm thank you to all of our customers who were very understanding of the issues and patiently waited while we investigated and worked on bringing the service back up safely.

Visit us at: http://www.dooblo.net

Categories: SurveyToGo Service

June 29th outage – post mortem

July 1, 2012 Leave a comment

Overview

The SurveyToGo servers were down on Friday, June 29th 2012 from ~5:30pm – 7:00pm GMT+2. During the outage offline data collection was not affected however, access to the SurveyToGo Studio was down along with the ability to upload collected data. Once the service was restored all collected data was uploaded and access to the SurveyToGo Studio was restored.

Root Cause

The SurveyToGo offering is hosted on the Amazon AWS cloud along with businesses such as Instagram, Pinterest, Netflix etc.. Amazon provides for a very solid cloud offering.

The origin of this outage is major storms that swept through the eastern United States Friday evening. The storms knocked out Amazon Web Services’ North Virginia hub, causing SurveyToGo along with a bunch of popular social sites like Pinterest, Netflix, Instagram and SocialFlow to go down and out for a few hours. More about this can be found here: http://thenextweb.com/insider/2012/06/30/amazon-web-services-outage-causes-netflix-instagram-heroku-and-more-to-grind-to-a-halt/

While SurveyToGo runs in a fully mirrored environment to overcome local Amazon availability zone issues, the power loss that hit the Amazon East region was so extensive and affected multiple availability zones along with the Amazon management console, leaving zero chance for our local measures to be of any use.

Lessons learned

Every service disruption is a chance to learn and make ourselves better. In this case, the power loss to AWS was so extensive that from a technical architecture perspective there isn’t much we could have done to reasonably prepare for it. However, for Dooblo, the lessons learned concentrate around the following:

  • Customer notification of service issues and customer updates.
  • Better monitoring facilities and more exposed one so that customers will have a go-to web page for status updates and availability queries.

Looking forward

We are constantly trying to improve our service offering and availability is a major aspect of the service. We will be using this outage to learn and improve our availability offering in order to provide you with the best service we can. Finally, we wish to extend a very warm thank you to all of our customers who were very understanding of the issues and patiently waited while we investigated and worked on bringing the service back up safely.

Visit us at: http://www.dooblo.net

Categories: SurveyToGo Service
%d bloggers like this: