longpelaexpertise.com.au/ezine/SyncpointRefresher.php?ezinemode=printfriend

LongEx Mainframe Quarterly - February 2021

technical: Refresher: Syncpoints and Units of Work

Over the past couple of years, I've been on projects where transaction resilience and recovery have been big issues. Terms like "unit of work" and syncpoint have been creeping into reports, and I've needed to dig deeper into two-phase commit processing, and how it affects the processing.

So, in this article, I'm going to give a quick refresher on syncpoints and units of work. I'll talk about mainframes and z/OS, but the concepts apply to any platform.

Unit of Work and CICS

Let's take CICS as an example. A unit of work is a group of operations that are Atomic: either all operations must complete, or none. Let's take an example: a banking application. A single transaction that transfers $1 from one account to another will involve two operations:

  1. Decrease Account A by $1
  2. Increase Account B by $1

OK, in any real application, there will be far more than two operations (like logs and audit trails), but I'll keep it simple. If our transaction crashes after $1 had been taken from Account A, but before it has been added to Account B, then we have a problem - our $1 will have disappeared. The owner of Account A will be unhappy ("where did my $1 go?"), as will the owner of Account B ("where's my money?"). To fix this problem, we need to 'undo' that first operation decreasing Account A by $1: we need to put the $1 back.

Suppose our transaction doesn't crash. In this case, we want to "commit" our changes: both operations have successfully completed. If the transaction later crashes, we don't want these updates undone. CICS manages units of work for us: it is a recovery (or transaction) manager. When a transaction starts, a unit of work starts. When it ends, the unit of work ends, and all units of work are committed.

If a transaction abends, CICS automatically backs out any recoverable update since the last successful unit of work completed.

Did you notice how I hid the work "recoverable" in that last paragraph? We only want to backout recoverable updates. We may have some updates we don't want to backout. For example, we may be logging each operation. In this case, we may want to 'see' the error: we don't want to backout the log entry. The log entry is not recoverable.

Transaction managers provide features that programmers can use to indicate if a change is recoverable or not. For example, CICS file definitions can specify if the file is recoverable. CICS also provides commands that we can use to tell CICS that a unit of work has been successfully completed: EXEC CICS SYNCPOINT. This tells CICS to commit all recoverable updates since the last syncpoint. The EXEC CICS SYNCPOINT ROLLBACK command can be used to tell CICS to undo any recoverable change since the last syncpoint: great for error handling routines.

Other Recovery Managers

CICS isn't the only recovery manager available on z/OS. Db2 can be used as a recovery manager. The SQL statements COMMIT and ROLLBACK do similar things to EXEC CICS SYNCPOINT and EXEC CICS SYNCPOINT ROLLBACK. IMS is another, providing the DL/I commands CHKPT and ROLL/ROLB. Yet another is IBM MQ: the MQI commands MQCMIT and MQBACK commit or rollback changes.

A recovery manager that comes free with z/OS is Recovery Resources Services (RRS). I'll talk more about RRS in a bit.

Recovery managers aren't limited to the mainframe. The X/Open XA standard is the standard used by most non-mainframe platforms to determine how to manage units of work and syncpoints. Most z/OS resource managers and recovery managers can work with XA compliant non-mainframe systems.

Seeing Uncommitted Changes

Suppose we are in the middle of our banking transaction: $1 has been taken from Account A, but not yet put into Account B. What do we want an application that looks at the balance of Account A to see? Do we want it to see the balance less our $1 (even though the withdrawal is not committed), or not?

In the case of CICS and VSAM, the application will see the balance less $1: it will see uncommitted updates. This can be changed if using VSAM Records Level Sharing (RLS) and read-integrity.

How about MQ? If an application PUTs a message onto a queue, should another application issuing an MQGET on that queue get that (uncommitted) message, or not? If the MQPUT was under syncpoint, no. If not under syncpoint, yes.

Db2 SQL queries can optionally specify ISOLATION (UR) to perform a 'dirty read' - read any uncommitted updates. Otherwise, these queries will wait for the update to complete.

Two Phase Commit

CICS recovery is pretty simple. We have a transaction, and if it abends, CICS automatically backs out uncommitted updates to recoverable CICS resources. But what about a CICS program that also uses MQ and Db2? CICS, Db2 and MQ are recovery managers. If a CICS transaction fails, we want to backout all CICS, Db2 and MQ resources. This is where two-phase commit is used. For a successful unit of work, it goes like this:

  1. CICS transaction updates CICS, Db2 and MQ resources
  2. CICS unit of work ends
  3. CICS tells MQ to prepare to commit MQ changes
  4. MQ says 'Ready.'
  5. CICS tells Db2 to prepare to commit MQ changes
  6. Db2 says 'Ready.'
  7. CICS gets ready to commit its own changes.

This is the first phase. If Db2 or MQ do not come back with a 'Ready' result, CICS will back out the entire unit of work.

Now the second phase. CICS tells MQ and Db2 to proceed with the commit, and commits its own updates. Db2 and MQ also commits relevant changes. Two phases.

Although MQ and Db2 can be recovery managers, we're using CICS as our recovery manager here. Every transaction needs one process that manages the unit of work for all resources.

Two Phase Commit Support

IMS and CICS support two phase commit processing. However, many other products do not support it: including Db2 and MQ. These products need to work with something that can manage the two-phase commit: such as CICS and IMS.

z/OS RRS offers two-phase commit support for batch and other environments that are not IMS and CICS. It can be called from any program, and provides an API to begin, commit and backout a unit of work.

MQ and Db2 have RRS connectors, allowing batch jobs to update those resources in a two-phase commit environment. Batch jobs can also access CICS resources using the CICS Transactional EXCI interface, and IMS resource using IMS OTMA or APPC/IMS. VSAM RLS also works with RRS using DFSMStvs. Websphere Application Server always uses RRS, and provides its own APIs for Java programmers to use. Many other z/OS products rely in RRS for two-phase commit support.

Cross System

OK, so two-phase commit works within a system. How about cross systems in a parallel Sysplex? No problem, RRS supports this. How about across different platforms? For example, a JDBC request to Db2 on z/OS from a UNIX application? RRS also support this.

Conclusion

Units of work and syncpoints aren't the sexiest of computing topics, and a lot of IT people don't really understand how they work. Rather, they rely on recovery managers and software to do the hard work for them. However, it is difficult to create a resilient system without a good understanding of units of work and syncpoints.


David Stephens