June 29th outage – post mortem
The SurveyToGo servers were down on Friday, June 29th 2012 from ~5:30pm – 7:00pm GMT+2. During the outage offline data collection was not affected however, access to the SurveyToGo Studio was down along with the ability to upload collected data. Once the service was restored all collected data was uploaded and access to the SurveyToGo Studio was restored.
The SurveyToGo offering is hosted on the Amazon AWS cloud along with businesses such as Instagram, Pinterest, Netflix etc.. Amazon provides for a very solid cloud offering.
The origin of this outage is major storms that swept through the eastern United States Friday evening. The storms knocked out Amazon Web Services’ North Virginia hub, causing SurveyToGo along with a bunch of popular social sites like Pinterest, Netflix, Instagram and SocialFlow to go down and out for a few hours. More about this can be found here: http://thenextweb.com/insider/2012/06/30/amazon-web-services-outage-causes-netflix-instagram-heroku-and-more-to-grind-to-a-halt/
While SurveyToGo runs in a fully mirrored environment to overcome local Amazon availability zone issues, the power loss that hit the Amazon East region was so extensive and affected multiple availability zones along with the Amazon management console, leaving zero chance for our local measures to be of any use.
Every service disruption is a chance to learn and make ourselves better. In this case, the power loss to AWS was so extensive that from a technical architecture perspective there isn’t much we could have done to reasonably prepare for it. However, for Dooblo, the lessons learned concentrate around the following:
- Customer notification of service issues and customer updates.
- Better monitoring facilities and more exposed one so that customers will have a go-to web page for status updates and availability queries.
We are constantly trying to improve our service offering and availability is a major aspect of the service. We will be using this outage to learn and improve our availability offering in order to provide you with the best service we can. Finally, we wish to extend a very warm thank you to all of our customers who were very understanding of the issues and patiently waited while we investigated and worked on bringing the service back up safely.
Visit us at: http://www.dooblo.net