Skip to content
Jarryd Aubert.

Writing

I tried to prove my tax calculator was correct. I mostly proved it was consistent.

PayeTax has more than three thousand tests. Trying to break the engine on purpose taught me how few of them could actually catch me being wrong.

5 min read

PayeTax has more than three thousand tests. When I tell people that, it sounds like the correctness question is settled. It isn't, and the gap between those two things turned out to be the most useful thing I learned building it.

I built a UK tax calculator first, before anything more interesting, precisely because it has a property a tester loves: given the inputs, there is one right answer. The system around it is fiddly, tax codes, cumulative basis, Scottish rates, student loan plans, all of it, but the output isn't a matter of opinion. Either the number matches what the rules produce or it doesn't. I wanted a first project where I could actually tell whether I'd succeeded, rather than one where "it looks fine" was the best I could say. So I tested it hard, and for a while the test count made me feel safe.

Breaking it on purpose

Then I did the thing I'd do to anyone else's code. I tried to break it on purpose.

I edited the tax engine to be subtly wrong, moving a band threshold by a single pound, and re-ran the suite to see what caught it. Some guards fired, which was reassuring. But the exercise taught me something the green ticks had been hiding: most of my impressive-looking tests prove the engine is consistent, not that it's correct. Those are different claims, and I'd been quietly treating them as one.

The clearest example was my golden master, a suite of scenarios covering Scottish rates, every student loan plan, pension relief, the high-income child benefit taper, the awkward edge thresholds. It looks like the heart of the correctness story. It isn't, because it's auto-generated from the engine it's testing. If the engine is wrong but wrong in a self-consistent way, the golden master regenerates to match the mistake and passes. It's a brilliant drift detector and a useless correctness oracle, and the only reason I know that is that I went looking for the seam instead of admiring the coverage.

So what does check correctness? I had a second oracle that reimplements the tax maths independently, on the theory that two implementations are unlikely to be wrong in the same way. That helped, until I traced its imports and found it pulls the actual rate constants from the same place the engine does. The logic is independent; the numbers aren't. Which means a wrong rate, as opposed to wrong logic, sails through both. The independence I was relying on was partial, and I hadn't noticed.

Follow that thread down and you reach an uncomfortable floor. Every layer of testing I have ultimately traces back to values I typed in by hand. If I fat-fingered a threshold into the engine and typed the same wrong number into the test's expected value, everything goes green and the calculator is confidently, comprehensively wrong. No quantity of tests fixes this, because they all share the same author and, in places, the same source of truth.

The one thing I didn't write

There is exactly one thing in the whole system I didn't write: my payslip. The rules, applied by my employer's payroll, produce a real number every month, and the calculator matches it to the penny. That single external check does something none of the three thousand tests can. For that one slice of reality, it confirms the actual figures, not just their internal agreement. It's a sample size of one, and for the case it covers it's worth more than every self-authored test combined, because it's the only oracle I didn't author.

The experiment also corrected me in the other direction, which I think matters more than the bugs it found. I'd been convinced I had a coverage gap, that I needed a test proving a higher-rate change produced a different tax figure. I went to add it and discovered it couldn't work: the engine derives monthly thresholds by rounding up the annual figure divided by twelve, a Math.ceil, so a one-pound band change rounds to the same monthly threshold and produces an identical result. The test I was sure I needed was impossible by construction, and the sensitivity I wanted was already correctly guarded somewhere else, on the eligibility boundaries, where it actually bites. My critique of my own suite was wrong, and the only way I found that out was by trying to act on it.

While I was in there, the checking turned up something more mundane and more telling: a handful of tests were already failing on that branch. An earlier cleanup had simplified the footer and left five assertions pointing at links that no longer existed. That mattered because it wasn't a tax failure. It was a reminder that a suite is only trustworthy when it's actually green on the branch you're talking about.

I came out of this with a calculator I trust more and a test suite I respect less, which is the right way round. I no longer think of it as three thousand proofs of correctness. I think of it as a large, fast proof of consistency, a smaller and partial check on the logic, and a single real payslip doing the actual work of confirming the numbers. I know precisely what each layer can and can't tell me, which is a better place to be than a high test count and a vague sense of safety.

That's the whole lesson, and it long predates AI making code cheap to produce. A test you didn't think hard about mostly tells you the code does what the code does. The question worth asking isn't how many tests you have. It's which of them could actually catch you being wrong, and which are just agreeing with you in a louder voice.