management: Is Five Nines Availability Possible, or Even Needed?
For the past six years, Information Technology Intelligence Consulting (ITIC) has performed a survey of server and operating system reliability. In 2014, this survey found z/OS as the system with the highest availability. This was no different from previous years, confirming what most people think as fact: z/OS on IBM mainframes is king when it comes to stability and reliability.
Another interesting fact from this survey is that 79% of respondents required five-nines availability for their mission critical systems. In fact, five-nines, or 99.999% availability is quickly becoming the standard target for key systems. But can you really get five-nines availability, and do you really need it?
Scheduled vs Unscheduled
'Five-nines' availability comes from the telecommunications industry, where this 'carrier grade' availability is expected for hardware and systems. To achieve five-nines, a system can be down for at most 5.26 minutes per year (or 5.27 for a leap-year). There is no standard, committee or ruling body formalising five-nines and what it means. So it is an often-used term that can mean different things, depending on where you're coming from.
Let's start with IPLs. Regular z/OS IPLs are essential for upgrades and system maintenance. Assuming each IPL takes an hour from start to finish, we can IPL at most once every seven years for five-nines. This isn't going to happen. What we can do is put together a Parallel Sysplex with workload sharing or something similar. So while one system is IPLed, others handle the workload. This kind of duplication will eat up a lot of money and resources to setup and maintain. Even with this configuration, you may still need to take down the entire Sysplex from time to time.
Many providers claiming five-nines availability will add small print to get around this problem. By excluding scheduled downtime, five-nines become a lot easier.
A complete computing application or system is made up of many pieces. Think hardware, software, networks, and applications. A failure in any affects availability.
The foundation of it all is the data centre. The Uptime Institute has a created a standard classifying data centre performance and availability into four Tiers. In a previous whitepaper, they state that the tier with the highest availability (Tier IV) will aim for a maximum availability of 99.995%. Even more interesting, they state that most Tier IV data centres won't get this 99.995%. So if 99.995% is the best we can do, anything running in only one data centre cannot make five-nines. We need to divide our workload between two data centres in different cities or countries: an exponential increase in costs.
Let's take another component: the network. Australia's largest telecommunications provider, Telstra, offer availability options for leased lines (at a cost) of up to 99.99% (four-nines). So again, we need to do something extra to bring this up to five-nines And remember that this is only the components provided by the vendor. Your own network infrastructure adds more components that could fail.
Similar arguments can be found for all components needed by any application. For example, an online z/OS CICS/DB2 application receiving traffic from Websphere MQ relies on six major components: data centre, z/OS, CICS, DB2, Websphere MQ and the network. And this is just the mainframe side. To achieve five-nines availability here, average downtime for each component can be at most one minute per year: six-nines availability. The required average reliability increases with the number of components.
More fine print can help if you want five-nines bragging rights: limiting the components covered. So a hardware manufacturer may talk about availability for their hardware device alone, or a z/OS team their z/OS systems excluding hardware and data centre outages.
What Does Unavailable Mean?
Anyone working with availability or service level agreements will be familiar with the problem of defining availability (or unavailability). For example Automated Telling Machine (ATM) transaction authorisations must be approved within 20 seconds. So if response times exceed this, the system is effectively down, even though everything may be working fine, but slowly.
Even if systems are performing, end-users may consider the system unavailable. Consider a computing system where one function is unavailable, but the rest are working fine. Or an end-user that can't use a system until a security change is made.
Training and application complexity are other areas to muddy the waters. If an end-user cannot do something because of lack of training or complicated user interfaces, that function is effectively unavailable.
Humans and Black Swans
The 2014 ITIC survey included another interesting point: 44% of network errors were from human error. Other surveys differ in this figure. But regardless of the details, human error is a large cause of errors and downtime. It can be minimised by ongoing training, change control, monitoring and problem post-mortems. However it cannot be predicted or eliminated. Nor can any other Black Swan event such as viruses, hacker attacks, terrorist activity, or rogue weather.
Aiming for a maximum of five minutes downtime per year effectively limits downtime to very small events that are automatically resolved by monitoring systems. If a human is required to resolve the problem, they are not going to detect, identify and fix it within five minutes. The moment a human is needed, your five-nines are gone for the year.
Five-nines, or any percentage availability figure looks good in Service Level Agreements (SLAs). It is easy to understand, and 99.999% (or even 99.99%) are attractive figures. However SLAs can be very creative when dancing around five-nines. For example, Amazon's EC2 Cloud service offers 99.95% availability, with credits if this is not met. However it includes much of the expected fine-print:
Another SLA from a different company claimed 99.999% availability with credits and rebates when this was not met. However the fine print was again interesting. Credits and rebates were only paid after total outages exceeded one hour per month. Or in other words, the company really offered 99.85%, or two-nines availability.
- Events outside of Amazon's control are excluded.
- Scheduled outages are excluded.
- Downtime is measured from when the Cloud service is in a Region Unavailable state: i.e. when Amazon systems detect that the service is down.
- Downtime is measured monthly: last month's failure does not affect this month's statistics.
Are Five-Nines Even Necessary?
A study by Compuware and Forrester Research in 2011 found an average business cost of US$14,000 per minute for mainframe outages. Using this figure, outages for systems with five-nines cost about US$73,500 per year. For systems with four-nines (99.99%, or 52.56 minutes downtime) this increases to US$735,000, and three-nines (99.9%, or 525.6 minutes) US$7.3 million. So a business case to improve availability from four to five-nines needs the extra hardware, software and resources to cost less than about US$660,000 per year. When we're talking about mainframe hardware and software, $660,000 doesn't buy much. A jump from three-nines to four-nines may save millions, and is a far easier business case.
Another question is: does the business really need such high availability? Here mainframes may be victims of their own reliability. If a system has no outages for two years, the business will expect and demand that this continue. But the fact remains that outages still happen. Just ask Air New Zealand in 2009, HSBC in 2011, or the Royal Bank of Scotland in 2013. Most businesses with mature mainframe-based applications have developed procedures in the case of an outage. For example, ATMs may provide limited service while the back-end host is unavailable. Similarly retail stores continue to have a manual way of accepting credit card payments.
The fact is that five-nines availability for an entire computing service is impossible to guarantee. There is too little room for error and Black Swan or unexpected events are impossible to eliminate. Many service providers advertising such five-nines availability use fine print to make it an easier target.
For many of those original 79% of respondents requiring five-nines availability for their critical systems, this many not be possible to cost-justify. Four-nines, or even three-nines may give them what they need for a far smaller price tag.