The crash that didn’t happen
Someone asked me in a quarterly planning meeting once, with genuine curiosity rather than challenge, what the monitoring stack had bought us over the last quarter. There had been no incidents. The dashboards had stayed green. Things had been, from the outside, uneventful — and the question was fair: what did we get for three weeks of engineering time?
I didn’t have a good answer. Which bothered me, because I’d been there for all three weeks. I’d watched the alerts fire, investigated the slow queries before they cascaded, restarted the service that was leaking memory before it took down anything that mattered. I knew the monitoring had done something. I just had no way to say what.
There’s a paper from 2001 that names this experience so exactly I almost felt accused when I first read it. Nobody Ever Gets Credit for Fixing Problems that Never Happened, by Repenning and Sterman — a management study, technically, about why quality programmes and process-improvement investments collapse. Their diagnosis: when the investment works, nothing happens, and nothing doesn’t look like evidence. The problems it was preventing are invisible, so management cuts the budget, the capability quietly erodes, a crisis eventually makes the need visible again, and the investment comes back at more than it would have cost to sustain. They call it a capability trap. It felt very familiar.
I understood the diagnosis immediately. It took me a few years in data science to understand why the trap is so hard to escape.
The branch of history you never ran
The fundamental awkwardness in “what did the monitoring prevent?” is that it’s asking for a counterfactual. A counterfactual is the road not taken: what would have happened in the alternate history where things were different? What would have happened, this quarter, if the monitoring hadn’t been running?
The problem is that you only ever get one quarter. You had the monitoring, things were quiet, and that’s the observation. The other version — same team, same environment, same three weeks, no monitoring — doesn’t exist. You cannot observe it. Every causal claim is, underneath, a comparison between something that happened and something that didn’t, and the thing that didn’t happen is the part that makes the claim interesting and the part you have no access to. Statisticians find this awkward enough to have given it a name — the fundamental problem of causal inference — which should tell you how often it bites.
Clinical researchers have been living with this constraint for decades. You give the patient the drug or you don’t; you can’t do both and compare outcomes for the same person. The solution that makes trials work is randomisation across large groups. Randomise who gets the treatment and who doesn’t, and the two groups are statistically interchangeable on everything you didn’t control — so any difference in outcomes bigger than chance could produce has to be down to the treatment. You still can’t observe both paths for the same patient, and notice what you get instead: the effect on average, across the groups, never for any one person. For most purposes, that’s close enough.
Nobody is randomising which engineering teams get monitoring and which don’t across a controlled quarter and publishing the result. There are observational substitutes — staged rollouts, before-and-after comparisons across teams — and every one of them is a weaker version of the experiment nobody will run. The counterfactual remains unobservable, in practice, always. And that’s the real reason the monitoring is hard to defend — not that management is short-sighted, though some are, but that the value genuinely lives in the branch of history you never got to run. When someone asks what it bought, the honest answer is: something, and I cannot tell you how much, because the comparison doesn’t exist.
Building the comparison on purpose
I recognised this constraint properly when I started doing A/B testing. An experiment forces you to build the counterfactual deliberately before you need it: you split the population, run both conditions simultaneously, and the control group is the branch of history that didn’t get the treatment. The comparison exists by construction, because you made it exist.
Running an experiment on a live system while also trying to maintain it is generally a bad idea. But the discipline of designing for a counterfactual before the fact — thinking about what you’d need to observe to make a causal claim — changes how you set up the monitoring in the first place.
The teams in the Repenning paper who avoid the capability trap aren’t the ones who argue better about what they prevented. They’re the ones who, during the quiet quarter, generate evidence that something is happening. Near-miss logs. The count of alerts that fired, what they were, and what was done with them. Chaos exercises that deliberately introduce the failure and measure the response — the nearest of the three to a real experiment, because it intervenes rather than waits. None of that is the full counterfactual — you still can’t see the quarter without monitoring — but it shifts the conversation from “nothing happened” to “here is what the system absorbed, here is the rate at which these events occur, and here is what catching them cost.”
It’s the same move that separates a green test that proves something from one that doesn’t: stop asking did it pass? and start asking what would have had to happen for it to fail? The quiet quarter stays quiet. But you’re building a record of the things that were trying to make it loud, and that, unlike the counterfactual, you can point at.
What the paper is really about
The burning platform — the crisis that finally makes the case for the investment — is what happens when the counterfactual finally runs: expensively, uncontrolled, because the investment was cut. The crisis is the control condition. The comparison becomes observable once you’ve removed the treatment, and everyone can suddenly see what it was buying. Nobody chose that experiment. It runs by attrition.
What I notice now is that the cycle isn’t a failure of communication or political will, though both might be present. It’s a measurement problem: the value lives in an unobservable, and absent deliberate work to approximate it, the only way it ever becomes observable is when the prevention stops. By then, it’s expensive.
The answer I’d give now
If the planning meeting question came up today, I’d give a different answer from the one I gave then. Not a better one exactly — just one that’s honest about what can and can’t be known.
The monitoring fired eleven times this quarter. Eight were resolved before they reached users. Three reached users briefly and were addressed within minutes. I can tell you the rate of events the system is encountering, the cost of meeting them, and what each category of event looked like the last time it got through. I cannot tell you the value of the eight that didn’t reach users, because the counterfactual is unobservable by construction — and anyone who gives you a confident number there is inventing it. The nearest honest substitute is a wide range built from the three that did get through, presented with its width intact. What I can tell you is that the events exist and the system is absorbing them, which is more than “nothing happened” and less than “we prevented £X of damage.”
That’s a worse slide than a clean ROI figure. It’s also the only version that isn’t pretending to know something it doesn’t.
Prevention is hard to justify precisely because it works. That’s not a failure of communication. It’s the structure of the question — and the first useful thing you can do is stop expecting an answer the question doesn’t have, and start building an approximation you can actually collect.
Part of an occasional series reframing everyday engineering through a data scientist’s eyes. The ideas here are developed properly in Thinking in Uncertainty and Building with Certainty.