The test that couldn’t have failed

inference
testing
A green tick only means something if a broken version of the code would have turned it red. Pick the wrong input and you can write a test that passes no matter what — proof of nothing, in a reassuring shade of green.
Author

Matthew Gibbons

Published

22 May 2026

It’s the good kind of pull request: small, focused, and it arrives with a test. Someone has written a tiny key–value store — a set, a get, nothing exotic — and a test that stores a value, reads it back, and checks the two agree. The bar is green. You skim the diff, nod, and approve it. So would I.

Here’s the store, near enough:

from collections import defaultdict

class Store:
    def __init__(self):
        self._data = defaultdict(int)   # a missing key reads back as 0

    def set(self, key, value):
        self._data[key]                 # oops: touches the key, never stores `value`

    def get(self, key):
        return self._data[key]
def test_set():
    store = Store()
    store.set("a", 0)
    assert store.get("a") == 0          # green

The test passes. The set method is also completely broken — it never stores anything at all. Both of those things are true at the same time, and that is the whole problem.

It passes because the value the test reached for, 0, is exactly what a missing key already reads back as. A store that works and a store that does nothing whatsoever give the identical answer to this particular question, so the assertion is satisfied either way. The green tick isn’t lying, exactly. It just never had the chance to say anything else.

This is the kind of change any of us would wave through on a good day: a method, a test that exercises it, a reassuring tick, and a feature that has never once worked.

You could file it under write better tests and move on, and you wouldn’t be wrong. But there’s something sharper underneath, and it’s worth slowing down for — because the same mistake comes back later wearing far more expensive clothes, and by then it doesn’t look like a typo in a toy class.

A green tick only counts if red was on the table

We read a passing test as a verdict: the code is correct, ship it. But a test never actually sees correctness. It sees one of two lights come on, and trusts you to work backwards to the cause. That backwards step only holds up if the other light was genuinely possible — if a broken version of the code really would have turned the bar red. A green is worth precisely as much as the red it ruled out.

For this test, there was no red to rule out. The working store returns 0 for key "a". The broken store also returns 0, because 0 is the default and 0 is what we asked it to keep. “The code works” and “the code ignores everything you tell it” make the same prediction — the same green tick — so no number of runs can tell them apart. And a test that can’t tell them apart isn’t weak evidence. It’s not evidence.

It’s the mirror image of the flaky test, the one that goes red when nothing is wrong. A flake at least makes a noise; you know something is off, even if you’re wrong about what. A test that couldn’t have failed makes no noise at all. It sits quietly in the suite for years, counted as coverage, defending nothing.

One number, two jobs

Look at where it goes wrong and it’s almost tidy. The number 0 is playing two parts in the same scene. It’s the value under test — the thing we’re trying to store. And it’s also the store’s resting state — the thing you get back when nothing was stored. One number, moonlighting, so you can’t tell which of its two jobs you’re watching it do.

A statistician would call that a confound: a second thing moving in perfect lockstep with the thing you care about, so that any result could belong to either. Here the input we chose is welded to the exact failure it was meant to catch. Every time the right answer would be 0, the bug produces 0 too. The test was never measuring whether set keeps what you give it. It was measuring whether 0 equals 0 — which it does, cheerfully, in working code and broken code alike.

And you can’t rescue it by leaning harder. Run it twenty times, run it three hundred; twenty greens tell you exactly what one did, because the flaw isn’t that you sampled too little — it’s that there was nothing there to sample. The answer you wanted was never inside this experiment to begin with, and you can’t average your way to information that was never collected.

What that testing advice was always about

The original bug report ends on what looks like a handful of testing tips: use values that aren’t the default; use a different value for each argument; try more than one input. They read like hygiene. They’re really the instinct of someone designing an experiment, arrived at from the engineering side of the fence.

Use a non-default value means: pick an input only the correct code could produce. Store 7, read back 7, and the broken store is caught red-handed — its default is 0, it can’t fake a 7. You’ve forced the two stories to predict different endings, which is the entire game.

Use a different value for each argument means: don’t let your variables disguise one another. A test that calls insert(1, 1) can’t notice if the code swapped key and value, because they’re the same 1. Make them different and the test can suddenly see a mistake it was blind to a moment ago.

Try more than one input means: one example rarely pins down a function. It’s the same reason you wouldn’t trust a straight line drawn through a single point — you want enough different conditions that a clean pass is hard to fake by accident.

None of this is about a longer test suite for its own sake. It’s the difference between an experiment built so the answer has to fall out of it, and one whose result was settled before you pressed run.

I learned to fear this one twice

I learned to spot this as an engineer; I learned to fear it once I’d crossed into data science.

The first time you train a model and celebrate 98% accuracy, and then someone gently points out that 97% of your cases were the same class anyway — that a model which always guessed “no” would have scored almost as well — you feel something familiar curdle. You’ve seen this before. It’s set("a", 0) again, in a lab coat. A number that looks like success and couldn’t have looked like anything else.

The same shape keeps turning up. A feature that quietly leaked the answer into the training data, so the model was being handed the result it was supposed to predict. A metric that’s really just the base rate wearing a rosette. Each time, the encouraging result and the worthless one are identical on the page — for exactly the reason the green tick was: nothing in the setup ever forced them apart.

What crossing from one of these disciplines to the other really teaches you, in the end, is a small and slightly paranoid habit. Interrogate the good news, not the bad. The failures announce themselves; it’s the clean runs, the flattering numbers and the green bars that deserve the suspicion — and the further a number travels up the chain, as a coverage figure on a slide or an accuracy quoted to the board, the more it’s trusted and the less anyone can still see what it was actually measuring. Of each one, ask what a broken version of the world would have had to do to hand you the very same thing — and whether it comfortably could have.

Before you trust the green

So here’s the habit worth building. Before you trust a passing test, ask what a broken implementation would have to do to pass it too. If the answer is “not much” — if returning a constant, ignoring an argument, or doing nothing at all would also come up green — then the test has no teeth, and the green is decoration. Choose inputs a bug couldn’t satisfy by accident. Make your values distinct, and different from the defaults, so that a pass is something only the right answer could have produced.

Which leaves you with two questions where you thought you had one. Does the test pass? And — the quieter, more useful one — could it ever have failed?


Part of an occasional series reframing everyday engineering through a data scientist’s eyes. The ideas here are developed properly in Thinking in Uncertainty and Building with Certainty.