Your model is wrong (and that’s the point)

foundations

thinking

Software engineers treat error as a bug. In data science, error is information. The distance between those instincts is shorter than you’d think.

Author

Matthew Gibbons

Published

10 January 2026

Here’s something that probably runs through most of your working day without you noticing it: assert f(x) == expected. Same input, same output, every time. If the assertion fails, something is broken, and you go fix it. Tests enforce this contract, CI gates on it, and the whole deployment pipeline assumes your code is deterministic. You deal with nondeterminism at the infrastructure level all the time — retries, timeouts, circuit breakers — but the code itself is expected to behave consistently. That instinct has served you well.

So what happens when the nondeterminism moves into the thing you’re modelling, not just the system around it?

Same model, different outcomes

Try this question: how many customers will visit our website tomorrow? You can pull in historical traffic, adjust for day of the week, account for a marketing campaign, and the number you arrive at will still differ from what actually happens. Not because the analysis was bad, but because the thing you’re measuring has real variability in it. Tomorrow isn’t a rerun of today.

To make this concrete, the code below simulates ten independent runs of a simple model, a Poisson distribution with an average of 1,000 visitors per day, over 30 days. Real traffic data are messier than this (overdispersed, autocorrelated, seasonal), but the Poisson keeps things simple enough to see the core point:

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

fig, ax = plt.subplots(figsize=(10, 5))
fig.patch.set_alpha(0)
ax.patch.set_alpha(0)

days = np.arange(1, 31)
for _ in range(10):
    daily_visitors = rng.poisson(lam=1000, size=30)
    ax.plot(days, daily_visitors, alpha=0.45, linewidth=1, color='#0072B2')

ax.axhline(y=1000, color='#E69F00', linestyle='--', linewidth=1.5, alpha=1.0,
           label='Expected value (λ=1,000)')
ax.set_xlabel('Day of simulation period')
ax.set_ylabel('Visitors')
ax.set_title('Same model, different outcomes: variability is inherent')
ax.set_xlim(1, 30)
ax.set_ylim(900, 1100)
ax.yaxis.grid(True, linestyle=':', alpha=0.4, color='grey')
ax.set_axisbelow(True)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(loc='upper left', framealpha=0.0)
plt.tight_layout()
plt.show()

Line chart showing ten simulated time series of daily website visitors over 30 days. Ten blue lines each follow a different path, fluctuating between roughly 920 and 1,080 visitors per day. A dashed amber horizontal line marks the expected value of 1,000. Although every trace was generated from identical parameters, no two lines are the same — demonstrating that this variation is an intrinsic property of the Poisson model, not a modelling error. — Figure 1: Ten simulations from the same Poisson(λ=1,000) model over 30 days. Every trace used identical parameters — the variation between them is the model working correctly.

Every trace uses the same parameters. The dashed amber line marks the expected value, the long-run average that individual days scatter around. Look at how the ten traces wander apart from each other despite identical settings. That spread is the model faithfully representing something true about the world: the process behind these numbers has genuine randomness in it.

The instinct that fires wrong

This is where an engineering background can trip you up. You see a model that doesn’t match the observed outcome, and something in your brain says the model is wrong. In a deterministic system, that reaction is spot on — residual error really does point to a bug. But when you’re working with data, a model that perfectly matches every observation is usually a sign of something worse. It’s overfitting: memorising the quirks of one particular dataset instead of learning the pattern underneath. Think of it as the statistical equivalent of hard-coding a return value. It reproduces the training data exactly, but throw anything new at it and the whole thing falls apart. This tension, between fitting the data you have and generalising to data you haven’t, is called the bias-variance trade-off, and it runs through nearly every modelling decision you’ll encounter.

A comparison might help land this. In software engineering, passing tests mean the system meets the specifications encoded in those tests, which is why tools like property-based testing exist: to probe whether your tests are actually telling you something meaningful. In data science, the equivalent suspicion fires when a model scores perfectly on its training data. A perfect training score usually means the model has mistaken noise for signal, and when it meets data it hasn’t seen before (the real test), performance drops off a cliff.

Getting comfortable with irreducible error takes some adjusting. Statisticians distinguish between reducible error (uncertainty you can shrink with better features or more data) and irreducible error (randomness baked into the process itself). You can chip away at the first kind, but the second kind always remains. And that residual carries information. It tells you something about the limits of what the data can actually reveal. Once you start reading it that way, as evidence rather than failure, a lot of data science starts to click.

The data-generating process: an API you can’t inspect

If you’ve ever integrated with a third-party API that ships without documentation, you already know the drill. You call the endpoint (collect data), study the responses (measurements), and build a mental model of the internal logic. But you never get to read the source code. Statistical modelling works the same way, just aimed at natural processes rather than services. Behind every dataset sits a data-generating process: the real-world mechanism that produced the observations (or at least, our best model of it). We never see it directly. We only see its output and reason backwards.

The analogy captures the right activity, reverse-engineering behaviour from observations, but the thing you’re reverse-engineering is different in kind. A data-generating process has no versioning, no changelog, and no stability guarantee. Customer behaviour shifts, markets move, seasons turn, and there’s no status page to tell you when it happens. You spot the drift from the data itself.

And then there’s the deeper difference: determinism. In theory, a well-behaved API returns the same response for the same request — though in practice, hidden state like caches, rate limits, and server-side A/B tests can make even a deterministic service look unpredictable from the caller’s side. A data-generating process takes that further: it returns a distribution of responses, a spread of plausible outcomes, each with an associated probability. The model’s job isn’t to nail the single right answer. It’s to describe that spread.

You already think this way

You already do probabilistic thinking every day. You just don’t build the distributions yourself.

When you look at a monitoring dashboard, you don’t say “our API latency is 50ms.” You say “our p50 is 50ms and our p99 is 200ms.” Those percentiles are distributional summaries. They tell you what’s typical, and they tell you how bad the tail gets.

SLOs work the same way: they’re probability statements. “99.9% of requests complete in under 300ms.” Error budgets are explicit acknowledgements that some failure is acceptable and that perfection isn’t the target. When a burn rate alert fires, you’re being told that the rate of acceptable failure has shifted, which is very close to what data scientists call distributional drift. And if you’ve ever run a chaos engineering experiment, deliberately injecting failures to learn about system behaviour under stress, you’ve already accepted that understanding the distribution of outcomes matters more than testing the happy path. You already set thresholds on distributional properties and make decisions based on tail behaviour rather than averages.

Data science takes this one step further. Instead of consuming pre-computed distributions from your monitoring tools, you build them from raw data, for systems where no SLO exists yet, so you can reason about scenarios you haven’t observed. The kind of statement you’d produce looks like: “there’s a 5% chance fewer than 950 customers arrive tomorrow, and a 5% chance more than 1,050 do; plan capacity accordingly.” You already make decisions under uncertainty every time you set an alert threshold or size a buffer. The shift is that now you’re constructing the model yourself rather than reading it off a Grafana panel.

What changes

Once this way of thinking settles in, a few things start to look different.

Error stops being something to eliminate. A model with residual error isn’t broken — it’s being honest about the limits of prediction. The goal is to explain as much of the signal as you can while leaving genuine noise alone. Knowing where that boundary sits is more valuable than pretending it doesn’t exist.

Point predictions start feeling incomplete. “The model predicts 1,000 visitors” carries less weight than “the model predicts between 938 and 1,062 visitors with 95% probability.” That’s a prediction interval. It tells you how much to trust the prediction, and whether the uncertainty is tight enough to act on.

And perhaps most usefully, you start to see how much of this you already know. Percentiles, SLOs, error budgets, capacity planning: these are probabilistic reasoning applied to systems. Data science applies the same thinking to data. The maths gets more formal, but the underlying question is one you’ve been answering throughout your career: given what I’ve observed, what should I expect next, and how confident should I be?

This article covers the opening argument of Thinking in Uncertainty, a book that teaches data science to experienced software engineers. The book continues from here into probability distributions, Bayesian reasoning, regression, inference, and model evaluation — all grounded in the engineering thinking you already have.