Writing

I tried to prove my tax calculator was right. I mostly proved it agreed with itself.

PayeTax has more than three thousand tests. Trying to break the engine on purpose taught me how few of them could actually catch me being wrong.

2026-06-035 min read

PayeTax has more than three thousand tests. When I tell people that, it sounds like the question is settled: the calculator must be right. It isn't that simple, and finding out why was the most useful thing I learned building it.

I built a UK tax calculator first, before anything more interesting, precisely because it has a property a tester loves: given the inputs, there is one right answer. The system around it is fiddly, tax codes, cumulative basis, Scottish rates, student loan plans, all of it, but the output isn't a matter of opinion. Either the number matches what the rules produce or it doesn't. I wanted a first project where I could actually tell whether I'd succeeded, rather than one where "it looks fine" was the best I could say. So I tested it hard, and for a while the test count made me feel safe.

Breaking it on purpose

Then I did the thing I'd do to anyone else's code. I tried to break it on purpose.

I edited the tax engine to be subtly wrong, moving a band threshold by a single pound, and re-ran the suite to see what caught it. Some guards fired, which was reassuring. But the exercise taught me something the green ticks had been hiding: most of my impressive-looking tests prove the engine is consistent, not that it's correct. Those are different claims, and I'd been quietly treating them as one.

The clearest example was my biggest saved-results test: a long list of scenarios covering Scottish rates, every student loan plan, pension relief, the high-income child benefit taper, and the awkward edge thresholds. It looks like the heart of the correctness story. It isn't, because the expected answers are generated from the same engine they are testing. If the engine is wrong but wrong in a consistent way, the saved results update to match the mistake and pass. It is a brilliant change detector, but a poor proof that the answers are right. The only reason I know that is that I went looking for the weak spot instead of admiring the coverage.

So what does check whether the numbers are right? I had a second checker that does the tax maths separately, on the theory that two versions are unlikely to be wrong in the same way. That helped, until I traced its imports and found it pulls the actual rate constants from the same place the engine does. The logic is independent; the numbers aren't. That means a wrong rate, as opposed to wrong logic, sails through both. The independence I was relying on was partial, and I hadn't noticed.

Follow that thread down and you reach an uncomfortable floor. Every layer of testing I have ultimately traces back to values I typed in by hand. If I fat-fingered a threshold into the engine and typed the same wrong number into the test's expected value, everything goes green and the calculator is confidently, comprehensively wrong. No quantity of tests fixes this, because they all share the same author and, in places, the same source of truth.

The one thing I didn't write

There is exactly one thing in the whole system I didn't write: my payslip. The rules, applied by my employer's payroll, produce a real number every month, and the calculator matches it to the penny. That single outside check does something none of the three thousand tests can. For that one slice of reality, it confirms the actual figures, not just their internal agreement. It is a sample size of one, and for the case it covers it is worth more than every test I wrote myself, because it is the only checker I did not create.

The experiment also corrected me in the other direction, which I think matters more than the bugs it found. I'd been convinced I was missing a test, and that I needed one proving a higher-rate change produced a different tax figure. I went to add it and discovered it couldn't work: the engine derives monthly thresholds by rounding up the annual figure divided by twelve, a Math.ceil, so a one-pound band change rounds to the same monthly threshold and produces an identical result. The test I was sure I needed was impossible because of how the calculator works, and the sensitivity I wanted was already checked somewhere else, where it actually matters. My critique of my own suite was wrong, and the only way I found that out was by trying to act on it.

While I was in there, the checking turned up something more mundane and more telling: a handful of tests were already failing on that branch. An earlier cleanup had simplified the footer and left five assertions pointing at links that no longer existed. That mattered because it wasn't a tax failure. It was a reminder that a suite is only trustworthy when it's actually green on the branch you're talking about.

I came out of this with a calculator I trust more and a test suite I respect less, which is the right way round. I no longer think of it as three thousand proofs of correctness. I think of it as a large, fast proof of consistency, a smaller and partial check on the logic, and a single real payslip doing the actual work of confirming the numbers. I know precisely what each layer can and can't tell me, which is a better place to be than a high test count and a vague sense of safety.

That's the whole lesson, and it long predates AI making code cheap to produce. A test you didn't think hard about mostly tells you the code does what the code does. The question worth asking isn't how many tests you have. It's which of them could actually catch you being wrong, and which are just agreeing with you in a louder voice.