Ever since someone first tried to represent the idea of packet switched network that was resilient to failure they probably used a picture of a cloud. Cisco’s official iconography for such a network is a cloud. It has been taken for granted for quite some time that if you throw a TCP packet at the Internet somehow it’ll get there and you don’t have to worry about it.
Back before some Cloud Evangelists I’ve met were even old enough to buy a drink we had SETI which could leverage distributed computing that was available on demand and was just ‘out there somewhere’ it’s not new.
So now Cloud is synonymous with computing platforms that offer high availability, scalability and are resilient but the truth of the matter is that there aren’t.
Cloud is great for people that want easy scalability, easy resilience and easy peace of mind. Services like AWS have great value add like ELB and EBS but if you’re like me and keep yourself on 24/7 on-call / escalation for your platform it’s just not a good idea because you don’t KNOW what is actually where and what services share what common risk factors.
With a public cloud a lot of this is abstracted away from you which for most people is great unless you want to guarantee that services are resilient.
On someone else’s platform you have no idea what the power layout is and you probably don’t care but the diagram below shows how you should employ resiliency and resource divergency at a per rack level if you want to know what’ll go down when which component fails;
All applications should exist on at least 2 servers for redundancy and probably exist on more for horizontal scalability. If it runs on two servers they should be in separate racks, if it runs on four servers they should be in two racks from separate UPS’s, PDU’s and fuses. Double up the racks you double your resilience it’s quite simple. One point to note is that the batteries in the UPS aren’t serially wired they are independently highly available ‘packs’ at N+1 redundancy.
At this point it’s quite obvious that you can double up and diversify within a cabinet and diversify again by spreading applications across racks to guard against top of cab access layer switching issues / distribution board bus bar issues. The Cloud response to this is to use load balancers and servers in different regions in a manner similar to this;
The problem here is that you’ve set up your servers in different availability zones and load balance between them but do you know how resilient they are?
- Do you know what the power / network / geo relationship is betweeen the availability zones and the load balancer?
- Do you know if when you lose an availability zone whether your two remaining servers reside on the same physical server / PDU / UPS / access layer switch?
- Do you know how resilient the load balancer is?
- Do you know if your EBS volumes inexplicably share something in common?
The answer to these questions (and more) is basically no. You may have some meta information from your Cloud console that empowers you to answer the questions but do you know? More likely than not the information is abstracted away, company confidential or just not documented because “it’s the cloud”.
I know the answers to those questions because my colleagues and I ran the network cables, installed the switches, configured the power distribution, checked / configured / wired the load balancers, racked the servers, deployed the software and once all that was done we tested it.
Of course everyone tests their resilience; the awesome guys at Netflix deploy ChaosMonkey to test their resilience works, it’s an amazing tool and an amazing testing methodology but at the end of the day they were let down by the cloud.
Instagram did everything right with load balancing, horizontal scaling and lots of monitoring but they still went down.
Acquia’s Dev Cloud allows you use ELB etc for your Drupal hosting infrastructure on AWS but they still went down.
I’m not criticising any of these companies or the awesome DevOp / SysAdmins that work there, nor am I criticising AWS, after all it’s an amazing platform. I’m simply saying that unless you do resilience yourself things are going to go wrong because you can’t know the whole picture on someone else’s Cloud and I’m never going to risk my platform by putting it on one.
A Cloud provider is an excellent way to have resilience, save money and speed things up when you first start building your platform. But when it gets to the point where you start measuring downtime in dollars rather than “time I would’ve been doing something else” it is time to move your critical infrastructure to something you know.