LongEx Mainframe Quarterly - February 2021
Over the past year, we've written a lot of articles about resilience. We've talked about resilience and Sysplex/CICSPlex, and how we implemented this at a client site. We gave some ideas about improving resilience: from eliminating space abends, to what we look for when evaluating resilience from a consultant's point of view. This issue, we've talked about syncpoints and recovery: an important resilience-related topic. And we've weighed in with our own opinions: about how resilience is in the details, and 'patrolling' can improve resilience. And resilience is so important: particular for us mainframe people, where processing is usually the critical 'back end' of the business. In 2014, Gartner estimated the cost of downtime to be $5,600 per minute. Seven years on, this value has to be a lot higher. We can certainly talk more about resilience, and ideas on improving it. But the bottom line is that for a computer system to be truly resilient, we need three things to go right: attitude, ability and assets. The Right AttitudeProgrammers that don't think about resilience won't code for it. They won't spend as much time checking data and input, less time on error handling or recovery procedures, less time on testing. Systems programmers that don't think about resilience won't double and triple check any changes. They won't fully research issues and potential problems, and they will be more likely to rush fixes and changes. To have a resilient system, you need staff that are engaged in maximizing resilience. Staff that have the responsibility for keeping systems resilient. And yes, that means that their management will need to be engaged, and regularly encourage and help their staff to stay on the ball. AbilityI can't tell you the number of times I've seen a problem caused by a mistake by someone whose knowledge or skills weren't quite up to the task. Now, mistakes happen (less with the right attitude). However, lack of knowledge, skills or experience cause so many issues that could otherwise have been avoided. You'll notice that I've added experience to that last sentence. In any team, you need senior technical staff with a knowledge of the application (and history) and technology (programming language, database system, middleware, APIs, environment and frameworks). To get there, they will need training, and time on the job. AssetsBy assets, I really mean enough people with time (I was looking for a word beginning with 'A' so it sounded good alongside attitude and ability). If a programmer is rushed or overworked, then the chances of failure are higher. Assets also means people that are thinking about resilience. Not just the 'nuts and bolts', but taking a step back and thinking from a higher point of view. Looking at development and testing practices, and thinking of ways they can be improved. Looking at change management practices, and if they're the best they can be. Looking at current and future projects, and seeing if there are things that can be done better. Looking at failure statistics over time, and determining if resilience is currently acceptable, or not. ConclusionHigh resilience is hard. The entire organisation: from management to programmers and operations, must have resilience as one of their top three issues. They need to be given responsibility for resilience, and a regular report card showing how well they are (or are not) doing. Or in other words: attitude, ability and assets. |