LongEx Mainframe Quarterly - May 2022
At a recent site, I took a quick look at their z/OS workload manager (WLM) setup. And it wasn't great. In particular, many workloads had consistent performance indexes (PIs) far higher than 1: these workloads were not meeting their performance goals. Talking to the z/OS administrators, they said that WLM was last looked at "many years ago" by someone who had since left. Since then, no one had needed to look at it: everything seemed fine. This is quite common. I regularly see sites with WLM configurations that aren't great. For example, their performance goals may be out, or workloads may be in the wrong service classes. Check out our partner article for a list of WLM issues we regularly see at sites. But this site highlights an interesting point. Sure, their WLM configuration wasn't perfect. But their z/OS systems seemed to work fine: z/OS administrators weren't aware of any major performance issues. So, why fix something that isn't broken?What WLM Gives UsMany sites don't really use WLM. They don't use I/O prioritisation, and have enough storage to eliminate paging. So WLM can't help with I/O or virtual storage. They also have more than enough CPU, so WLM can't help here either. At these sites, it's easy to forget WLM. However, this assumes that everything continues to work as planned. To give an example, at one site the CPU spiked for a couple of hours, and the performance of their critical workloads tanked. Lots of phone calls from users, lots of attention from management. Let's think about this situation for a moment. At a site that didn't use WLM, they suddenly needed WLM to triage their workloads: giving their critical workloads priority access to CPU. However, their WLM configuration wasn't up to the task, and they suffered. To resolve this, our site decided to maintain a CPU buffer: unused CPU just in case the same situation happened again. Unfortunately, this CPU buffer costs them buckets of money every year. WLM is designed for handle this situation. With a correctly configured WLM, critical workloads can be given priority access to CPU, allowing less-important workloads to wait. z/OS is designed to operate at 100% CPU usage. I don't subscribe to the position that you need unused CPU: there are better options. ReportingSo, WLM is great for emergencies, or situations when there's not enough CPU to go around. But this isn't everything. Many sites I see are reactive with their performance monitoring. z/OS performance staff wait for someone to complain before investigating performance issues. And you can understand why. z/OS systems run many different workloads, each with different performance needs and goals. And some workloads (online transactions) will be more important than others (like COBOL compiles). It's a lot to ask someone to understand all the workloads and their requirements, and continually monitor them. There is an easier way, as we explain in our partner article. z/OS administrators have already separated z/OS workloads into groups in the WLM configuration, and assigned importance values and performance targets to each group. WLM also provide the Performance Index (PI) to show whether each group is meeting their objectives or not. So, WLM has a list of workload groups, and regularly calculates a single number for each that can be used for performance reporting. Or at least, it can if WLM is configured correctly. ConclusionI'm a big fan of a well-managed z/OS system. Such a system will have a well-configured WLM. This ensures that high importance workloads get priority when resources are tight, and provides a simple metric that z/OS performance analysts can use to quickly view how a z/OS system's performance is going. |