How Reliable is the Cloud?

The great thing about "cloud based computing" is that - like networks - you can get redundancy without penalising performance or cost.

Recently, I have heard a few stories from people who have had less than great experiences in the cloud. Outages and slow responses have led them to reconsider their 'bold' move. Part of this is to do with the capability they were using, but before we get to that let's take a step back and consider "how do I make something reliable?" By reliability I mean availability, i.e. how many minutes or seconds of downtime is acceptable.

When I first started work I was a microelectronics design engineer. One of the first things I worked on was a device that translated the physical movement of a plane's flap into a digital number (representing the angle) that the flight computer could then use to determine how much to move the flaps based on the pilot's direction. Obviously, this is not a device you want to get wrong - it needs to be super reliable. We did our best, but the folks building the system on the aircraft figured that despite our best intentions things would go wrong. So, they didn't just double up, they doubled up again and then again. They also took our accuracy figures and chopped off the bottom four digits to ensure that any variation was eliminated in the design. Did it work? Thankfully yes - to my knowledge none of the planes ever experienced any "reliability" or availability issues with the feedback sensors on the flaps.

How is this relevant to the cloud? Well it's really the same. If you buy one instance of the cloud and it's got an availability of 99.95% then you will have potentially 21.6 minutes of downtime in any month. What are the odds that it will happen? Well, if your availability is 99.95%, you are two and half times as likely to HAVE an outage than hit a hole in one in golf. If, however, you can get 99.99% of availability, then you are into realms of only 4.32 minutes of downtime, where you are dealing with the same odds as the chance of matching two samples of DNA. So, the difference in availability ranges from a lucky stroll on the golf course to the evidential basis by which someone's liberty can be removed.

Many of you will be a little circumspect and look at the theory, wondering why there have been headlines regarding services in the cloud failing. So, to improve the availability and reduce the probability of an outage, you can take two instances of your cloud service. If the original availability is 99.95% then your odds of both instances disappearing is 1 in 4,000,000. That's less likely than being struck by lightning, so pretty good, and twice as unlikely as the 5-sigma discovery requirement of particle physicists. However, if the original spec is 99.99% then we are in the realms of a statistically improbable 1 in 100,000,000.

While the maths is reassuring it assumes that every event is utterly independent, which in the inter-connected world of networks and data centres isn't strictly true, but you get the idea. You can of course keep going to the utterly rock solid surety of 5 nodes - which may seem fanciful but it's the basis of the way the Internet stays so reliable to meet the demands of today's applications. Not surprisingly, the more cloud you have the better it is. But for this to be a redundant system they also have to have similar characteristics. With cloud computing this really means they need to be similar in how they are experienced. For example, if you have a web site and it goes down, you want the replacement to give the same, not a degraded experience. Going back to our plane example; you wouldn't want it to fly less well if the primary sensor failed and secondary kicked in - you wouldn't want any change at all.

In the past, the economics of physical data centres, equipment and networks made the "back up" site a second class citizen as it was a straight cost/risk assessment. The great thing about "cloud based computing" is that - like networks - you can get redundancy without penalising performance or cost. However, if your back up is half way around the world, latency will cause the effect of the primary not being there to be felt and the applications you can use would be more limited. So you need to have two on the same continent, at least, and preferably some place where most traffic aggregates. For example, from a European perspective, you don't want your primary in the 45th most connected place (a westerly island favoured by AWS and others), when really your cloud is only there because of cheap corporate tax rates enjoyed by the cloud provider, and the other site to be in North America. Better to be in and around the centre of what is a market of 400 million people.

With Interoute's Virtual Data Centre you can pick up to 5 sites in 5 of the biggest locations in Europe (with more being added as I write). You can run as many as you want simultaneously, and you don't pay for the network in between zones.

In short you can build the most resilient, scalable, public-private computing platform in the world, now, today, this minute.

Before You Go