Learning from Mistakes

Jul 01, 2012

My cat is very agile, readily making large jumps and picking her way along ledges and fences. She rarely makes an error. Occasionally hoŵever, she will attempt a jump that is too great, or where some intervening object requires a mid-air adjustment, and she will fall. She lands awkwardly, though on all-fours, and walks away. She does not look back. Her aloof and dignified demeanour speaks to the cat-lover. It is as if the jump had never been attempted. Who ... me?

I sometimes think that the computing profession is like this. Strangely, the more spectacular the mistake, the less we analyse it. Though we talk of failure and, as I have discussed elsewhere, we have a professional discourse in which failure plays a central role, we can be remarkably unanalytical.

The recent spectacular system failure of RBS, NatWest and related banks, that has effectively prevented the operation of a major financial institution with serious consequences for customers and business reputation, is a case in point. There is a real dearth of substantive technical consideration. Public comment has focussed on the senior management, their bonuses and their cancelled corporate entertainment at Wimbledon: in my judgment the least interesting aspect of the matter.

To the best of my knowledge only 'The Register' has published a plausible, albeit speculative, technical analysis. Of course "systems fail systemically" so there is likely no simple cause. As far as this account goes it appears that while managing a complex, distributed set of updates, batched and scripted (using a 3rd party batch scheduling tool for this purpose, CA-7), across mainframe systems that support ongoing operations, a failure occurred. The normal procedure would be to gracefully back-out of the updates, something that was provided for. Somehow during this process either the entire schedule and/or the queue of back-out operations was deleted (whether this was as a result of a bug in CA-7 or user action, or both, is unclear). This (presumably) leaves the system in an inconsistent state and the operators are unable to go either backwards or forwards. The natural reaction would be to fall back to some sort of checkpoint and I am unclear as to why this was not possible, perhaps there were other ongoing changes, but this is speculation, built on speculation. The result was that all the updates had to be processed "manually", whatever that means.

It is known that there had been some substantial outsourcing of IT operations along with a 'downsizing' of the associated IT teams. The suggestion is that key expertise had been lost during this process and that significant roles, entailing knowledge of the batch scheduling tool and of the setting in which it is applied, had been placed in different parts of the organisation. This is plausible, at least as an exacerbating factor, though I am reluctant to accept it without some independent verification: after all, those former employees with sufficient expertise to attempt a diagnosis also might have an axe to grind.

At any rate this is simply an elaborate illustration of my main argument. When such incidents occur there is a responsibility on those concerned immediately to provide a technical account of the failure that is as accurate as can be delivered at the time. CTOs / CIOs need to step up to the plate. This is a responsibility to the public and to the promotion of a broader understanding of technology and its associated risks. There is a further responsibility to, in due course, openly disseminate a complete analysis to serve the professional community. This responsibility trumps managerial convenience, commercial interest and legal or other back-covering.

prof serious

Discussion about this post