LongEx Mainframe Quarterly - November 2021

opinion: Does Too Much Change Management Actually Reduce Resilience?

At one site I was working, there was little change management. I could basically make any change I wanted, no questions asked. However, with a change in management, change management was introduced. I now had to 'package' up my changes, create batch jobs to implement them, justify the reason for the change, and have someone review it. And something interesting happened: my changes were better. Fewer issues, better tested and reviewed, safer. But can change management go too far?

Change management is essential for managing any modifications to a computing system or application. This management ensures that the change is required, has been sufficiently tested and reviewed, and does not impact other changes or business requirements.

However, I often see change management processes and procedures that are, well, too difficult. For example, in one site changes must be made at least 14 days before the change. They must be reviewed and approved by several managers: some of whom do not understand the technical nature of the change, or what it does. Changes are often rejected because an approver did not understand the change, and thought it impacted something it didn't. Staff proposing the change often must attend meetings where many changes are reviewed, in case there are questions about the change. So, a staff member may wait for an hour on a call, in case they need to answer a question for one minute.

What's more, changes are not approved until one or two days before the change. This makes it difficult for teams to schedule staff to implement and monitor the change, and obtain resources from business units to validate function after the change is implemented.

Such 'heavy' change management procedures aren't unusual. Sometimes they are 'knee jerk' reactions to outages: a quick option that is seen to address an outage that occurred, and prevent it from happening again. Other times, they are an evolution over time for changes from many different groups.

The problem with heavy change management procedures is that it could do the exact opposite of what is intended: they could impact resilience. But how?

Any computer system must be maintained. Changes are always required: from security rules to fixes that stop crashes and abends. In many cases, application groups will know about problems, and have fixes ready for them. New hardware will often require software changes.

A heavy change management system will slow these changes down. This could have a few effects:

  • Changes are bottlenecked. Required changes are rejected, or delayed by the change management process. The number of pending changes increases. This long queue may make essential changes difficult to schedule. It also will add pressure to batch changes up together: making more changes in a change window in an effort to reduce the bottleneck. The more changes implemented at the same time, the greater the risk.
  • Changes are discarded. At one site, I found an error, and discussed it with the relevant team. The staff knew about the problem, and had coded a fix 12 months previously. However, staff didn't implement the change as the change management process was too hard, and the staff member had other things that had to be done. In this case, the problem caused issues that the business had to 'live with' for over 12 months.
  • Problems are not fixed. At one site, we made a configuration change. We then found out that this change fixed a problem that the business had been experiencing for some months, but had not reported. If technical staff have to spend more time working on change management procedures, they have less time to resolve problems.
  • Change procedures are bypassed. Many technical staff are passionate about their area, and want it to work efficiently. If change management procedures are seen to be 'too hard,' some staff will look for ways to avoid it. This may include hiding changes in another change submitted, or looking for ways to implement changes that do not require change management approval.

A heavy change management system can also affect the culture of an organisation. If changes are very difficult to get approved, it indicates that the organisation doesn't accept risk. So, changes that have an element of risk may not be permitted. For example, at one site a change was requested, but rejected as it had not been tested in a test environment. However, there was no test environment available. The proposal was to 'try' this low risk change in production, and during the outage window validate that it was successful. If not, there was sufficient time in the change window to back it out. The change was not approved, and was never implemented.

So, what am I saying? Change management is essential to managing resilience and reducing risk from changes. However, I believe that resilience benefits change with the 'weight' of change management procedures, and this change is something like a bell curve:

Resilience benefits will increase at first, but as the weight of the change management procedures increases, resilience will decrease. Ideally, we want to find the 'sweet spot' where the resilience benefits are maximized.

The reality is that maintaining computer systems involves some risk. Hardware can glitch, software products have bugs, staff make mistakes, and programmers are not perfect. There is never a way to guarantee 100% success, 0% problems. The aim of change management should be to minimise risk, while allowing change and development to continue as smoothly as possible with the minimum amount of red tape.

David Stephens