How do you measure good software tests?

I don't know

Sunday, May 24, 2020 permalink
tags: programming - automation - testing

Automated testing is a modern software engineering practice with a lot of benefit, but as with everything wholsesome and good, it attracts a lot of cargo cult practitioners as well.

Story time! Shortly after starting my first job, eight years ago, I joined a project as a contractor. One of my first assignments on that project was to take an untested class that a senior contractor from a different company had written and add unit tests that hit 100% branch coverage.

My approach was basically to assume that the code was correct, and write tautological test cases; rather than asserting the code was correct, I asserted the code returned what it did, while making sure each branch was hit.

I didn’t understand the code being tested whatsoever; most methods were probably four or five levels deep of nested if and for blocks. I merely had to construct inputs that hit all the branches and then assert that the output was in fact the output.

I’ve told this story many times; when I first told it, the point was about how I was made to do pointless busy work, but with experience I realized that what really happened was that I got paid to make the project worse.

Bad tests are worse than no tests. They take time to write, make refactoring and adding features harder, and some developer in the future will need to spend time and energy reading them and probably deleting or replacing them. Good tests subtract from technical debt, but bad tests add to it.

The tests that I added didn’t help ensure the correctness of the system under test, they merely added mass to the existing code base. They didn’t change the behavior of the system or help describe the behavior of the system, they simply made changing the behavior of the system harder.

That project had another funny testing quirks as well: one portion of the codebase leveraged generated code, and one entrepreneurial junior developer recognized that this generated code was untested and that this brought down his coverage numbers. Rather than (or perhaps in addition to; it’s been a while) just testing the code generation step, he just ensured that the code generator also generated unit tests! Win!

It’s kind of funny solution to a problem that shouldn’t actually exist.

The above examples may be a bit unusual (although I assure you true), but the point is you have to be careful about the metrics you use and how you use them. Maybe coverage is an interesting metric, but by using that as a benchmark, it lost its actual significance.

Any superficial metric can be gamed if it is set as a target: lines of code, hours spent, etc. I’m not saying they’re not interesting; just don’t put too much weight in them.

How do you define testing success?

There’s a common saying that “you can’t manage what you don’t measure,” but I am not the first person to observe that this is a bit of an anti-proverb. Ultimately engineers have to take accountability for our coding standards and processes.

I suspect that if you’re looking for metrics for automated testing, you’re probably interested in two things: avoiding shipping bugs, and delivering code quickly.

The most essential measurements are how many bugs you ship, how much time you spend fixing bugs, and how many features you ship.

Number of tests and coverage are certainly useful feedback to reviewers. As a reviewer, seeing what test cases were added describes intent, but only if tests are readable. Seeing that coverage went up (or stayed up) with a new test gives me confidence that edge cases were handled.

Without reading code, if I see a lot of test and coverage, I can be assured that a lot of effort went into testing, but that’s it. I simply don’t know how confident I can be that a commit won’t break the next release without actually reading the tests and understanding the system.

Unfortunately, you still need humans in the loop to hold tests accountable; to ensure that they enforce their module’s public interface and are readable and maintainable.