Home | Cloud | So Your AWS-based Application is Down? Don’t Blame Amazon

So Your AWS-based Application is Down? Don’t Blame Amazon

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 0 Flares ×

After a busy day in London I returned home to read the news of issues in one of Amazon’s US data centre locations causing problems with EC2 and database (RDS) instances.  It seems the services of many Internet companies were affected including Reddit, Quora, Hootsuite and FourSquare,  Is it fair that Amazon should shoulder the blame for the loss of service to the customer or is there an underlying issue of design here?

First of all, from an availability and resiliency standpoint, it’s worth having a look at Amazon’s definition of regions and availability zones.  AWS is currently available in 5 regions classed as; US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo).  Within these regions there are multiple “availability zones” – separate locations which Amazon claim are “engineered” to be insulated from failures in other availability zones, presumably as physically separate data centres with independent networks, power delivery and so on.  It seems on the face of it reasonable to assume that if Amazon claim resiliency within a region by using availability zones that designing an infrastructure that sits in a single zone should be acceptable; I disagree.

As far as I am aware, Amazon publishes no specific details on how their infrastructure is plugged together and inter-operates across geographic boundaries.  Therefore it’s impossible to understand how availability zones actually work and how they have been engineered to isolate against failure.  As we saw yesterday, the whole of region US East was affected (and at the time of writing still is) regardless of location, making it obvious that the availability zone protection isn’t guaranteed in all circumstances.

When organisations design their own data centres, they understand their business requirements and the infrastructure is based on that information, including how and where data centres should be sited.  Financial organisations, for instance, are required to site their data centres a certain distance apart for resiliency.  Features such as synchronous replication at the array level, high availability, application data replication can all be used to ensure service is not disrupted because the infrastructure team have (hopefully) engaged with the application owners to understand their specific requirements.  If that requirement were (for example) 100% data integrity, then data would need to be synchronously replicated to another location to ensure it could be accessed in a recovery scenario.

Amazon have, with AWS, provided generic infrastructure without publishing specifics on how that infrastructure is delivered.  This is fine, as AWS is delivered as a service, however availability zones are not guaranteed against all failures (merely engineered against it) and it would be foolish to assume any organisation could guarantee against all possible disaster scenarios.

If you are delivering a service using cloud infrastructure it is your responsibility to determine the level of failure you are prepared to accept.  That could mean running services across multiple providers, a subject I discussed 2.5 years ago in this post; http://www.thestoragearchitect.com/2008/12/16/redundant-array-of-inexpensive-clouds-pt-ii/.  Although this post was more storage focused, the concepts still apply to application design.  If you’re starting a business from scratch, then there’s no excuse these days not to engineer across multiple regions or even multiple providers (in fact, the effort of going multi-region will be comparable to that of going multi-provider).  Obviously some applications will be more difficult to implement in a diverse manner than others, however looking at the four web-based applications I quoted at the top of this article, I expect that all of them have a large degree of read-only traffic and a lot of “write-new” data with only a small percentage being updates.  That being the case it would be relatively easy to distribute read I/O geographically and to stage writes in the same manner, synchronising data on a periodic basis.

The Storage Architect Principles

  • Basing your infrastructure in “the cloud” is not a bad thing to do
  • You must understand your business service requirements and design to them
  • You must understand the service offering of the cloud provider
  • Design around availability, resiliency,  and therefore mobility, at the application layer
  • Using multiple providers is a good thing
  • Don’t let cost saving blind you to reducing service quality

There’s one other thing to bear in mind (as the final bullet point above alludes to); US East is also the cheapest location for Amazon services (and I presume the largest).  The cynic in me wonders is some of the service implementations have been based on cost rather than service level availability, especially where these services are free to the end user.

About Chris M Evans

Chris M Evans has worked in the technology industry since 1987, starting as a systems programmer on the IBM mainframe platform, while retaining an interest in storage. After working abroad, he co-founded an Internet-based music distribution company during the .com era, returning to consultancy in the new millennium. In 2009 Chris co-founded Langton Blue Ltd (www.langtonblue.com), a boutique consultancy firm focused on delivering business benefit through efficient technology deployments. Chris writes a popular blog at http://blog.architecting.it, attends many conferences and invitation-only events and can be found providing regular industry contributions through Twitter (@chrismevans) and other social media outlets.
  • http://mebbi.net Richard

    I totally agree, and it’s refreshing to see someone else thinking the same rather than just jumping on the ‘Let’s trash AWS’ wagon because they didn’t research or design their own cloud implementation intelligently using using the *tools* provided (by Amazon).

    As you quite rightly say, it’s common sense to design your infrastructure for resilience no matter where it may be situated (ground or cloud) and the strength of the Amazon cloud is that it lets you do exactly that, quickly and easily.

    Agreed, it’s not ideal US-East is having problems and they can’t escape a degree of blame, it will most certainly have brought to light some technical shortfalls and will have an impact on their business, however let’s hope they’re not the only ones to learn from this.

  • Xav

    I appreciate both of your comments (Richard and Chris). We are one of the companies impacted by this outage for the past 24 hours. On one level, that of being responsible for architecting (part of the team) our solution on AWS I am clear now that there’s some new things to learn, consider and implement here that wasn’t on our radar until now. On another level, we’ve been running with no issue on AWS for 2.5 years and perhaps we got a little overly confident where we should have been more cautious. We felt like we had not cut corners with having an abundance of backups running throughout the day and we felt confident with how confident AWS seemed about their platform. Fundamentally though, per both your articles, I’m clear that at the end of the day, I’m the one who is going to thrive or dive in this semi-new paradigm of Cloud Computing. We’re hurting now after being awake 24 hours trying to mitigate the damage, but I know after some sleep and food that we’ll be stronger out of this. Hopefully, our clients who we serve will have as optimistic views as I do. Thanks for the wake-up in any case. I like the taking personal responsibility approach rather than trying to blame AWS. Overall, they have been an amazing partner and very inventive with their view and implementation of the Cloud!

    • http://www.brookend.com Chris Evans

      Xav

      It’s a shame you’ve had to experience the learning curve in such a dramatic way. Hope you resolve your problems.

      Chris

  • http://Www.virtualpro.co.uk Craig Stewart

    Great article Chris!

    I can’t help but wonder what damage has been done to the general cloud concept as a result of this outage.  The sensationalist headlines in the press these last few days will have a lasting effect on public perception and that’s a shame.

    By design I believe that Amazon AWS offers a 99.75% uptime SLA. This in itself means that businesses building on AWS only are willing to accept 4 hours of downtime per year. If that is unacceptable to the business then building across multiple availability zones or service providers is a key message.

    When looking at moving to a cloud platform your due diligence of the service provider, their offering and their infrastructure is key. Know the weak points and design your solution to mitigate them.

    Although public confidence has been dented by this, the positive here is that cloud service design will improve. We must learn from the mistakes, a point Xav made very well above.

    • http://www.brookend.com Chris Evans

      Cheers Craig

      I think one incident will not stop the rollercoaster…

      Chris

  • http://www.langtonblue.com Rob Lyle

    Craig,

    99.75% SLA … what is “uptime” and what happens *when* it is not delivered? If the consequences do not sufficiently motivate the service provider, it’s not worth having.

    I often compare weak SLAs to the ROI of an insurance policy that never gets claimed. Only when you need it, do you discover it isn’t there!

    –Rob.

  • http://softwareinsane.com Philip

    really like the post! technology really only does what you tell it to do… building on cloud providers requires a certain architecture and mindset. I doub’t that new startups and companies really just ‘get it’ yet…

    If you are in SF next wee (for Google I/O?) I would love to invite you to the ‘T1000 Gathering’ meetup on May 9th to discuss what strategies we can promote to ‘startup land’ to build new things with technical failure in mind and how to prevent issues like the ones that the EBS outage generated.

    It is an informal meetup with no set speakers but a big topic to discuss – would love your input http://bit.ly/mpU5tT

    • http://www.brookend.com Chris Evans

      Philip, thanks for the positive comments. Unfortunately I’m based in the UK and not planning to be in SF next week. I should add the event to my diary tho’ for a future date.

      Regards
      Chris

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 0 Flares ×