Using `make` and `git diff` for a simple and powerful test harness

People regularly go for complex test harnesses; and when you’re in an ecosystem, they can be a good fit. But outside ecosystems, it’s actually quite easy to produce a fairly good test harness.

Tagged: Make
Published: May 22, 2020

This article is mostly targetted at developers curious about Make; using a valuable but not widely‐employed scenario, it shows what Make makes possible, and how it works.

First, a demo. Here’s what is achieved with only half a dozen lines [(Depending on how you count it.)] of makefile:

Figure 1: a terminal recording demonstrating the test harness in action.

(It’s moderately fast‐paced, so pause as needed if you want to stop and consider rather than letting it wash over you.)

⚠️There’s supposed to be a terminal recording player here, but it looks like .
If you want to play the recording without JavaScript, you can play it in any software that supports the ttyrec format. I use termrec, with which this will play it:
$ termplay https://chrismorgan.info/blog/make-and-git-diff-test-harness/demo.ttyrec

The makefile

Let’s jump right in with a sample makefile, typically with the filename Makefile:

Figure 2: contents of Makefile.

rwildcard = $(foreach d,$(wildcard $1*),$(call rwildcard,$d/,$2) $(filter $2,$d))

.PHONY: test
test: $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))

%.stdout: %.test
	./$< > $@ 2> $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)
	git diff --exit-code --src-prefix=expected/ --dst-prefix=actual/ \
		$@ $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)

That’s all the code needed for a fully functioning test harness.

If you’re familiar with makefiles this may be enough for you to understand this whole post and you can stop reading. The remainder of this post explains how to use this, how it works, and the limitations of the approach.

Note that this is all for GNU Make. Other Make implementations may not support all of the foreach, wildcard, call, filter and patsubst functions, so you could need to write out the test target’s prerequisites another way. The general principle is sound, however.

Getting started with it

Make sure you have GNU Make installed and are using Git.
Paste the code above into a file called Makefile.
Create an executable file [That is, on Unixy platforms it’ll need to have mode +x, and either be an executable binary or start with a shebang.] called foo.test that contains the script of what you want to test.
Run make test (the test run will seem to succeed, since the expected output hasn’t been added to Git yet [Most things in Git ignore untracked files; and git diff … foo.stdout foo.stderr outputs nothing and reports success when those two files aren’t tracked.]).
Review the contents of foo.stdout and foo.stderr which have just been created, that they match what you expected.
Run git add foo.test foo.stdout foo.stderr.

This is not intrinsically tied to Git; you could replace the git diff invocation with something semantically equivalent from another version control system.

How to use it

As you can hopefully see, this is simple; only half a dozen lines. Yet it’s very powerful: by using Make, you get all sorts of nice magic for free.

If tests vary on other inputs (e.g. test data or build artefacts), you can just add new prerequisites to the targets:

Figure 3: adding build prerequisites to tests. Here, all tests runs depend on user-data.csv, and the “bar” test depends on bin/bar.

%.stdout: %.test user-data.csv
	…

bar.stdout: bin/bar

And you can get much fancier about prerequisites if you desire. These things can allow you to only run the tests whose inputs (whether it be data or code) have changed, rather than all the tests.

If you always want to run all the tests, which you probably want to do if you haven’t set up precise dependency tracking, mark all of the %.stdout targets phony (which will be explained below):

Figure 4: ensuring that all of the tests will be run each time you run make test.

.PHONY: test $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))

If you just once want to run all the tests, run make --always-make test [The short form of --always-make is -B, but I recommend long options in most places.].

If you want to run tests concurrently, use make --jobs=8 test [Short form -j8, and this is one that I do use short form for.] or similar. (Caution: combined with --keep-going, you may get nonsense output to the terminal, with lines from multiple diffs interleaved.)

By default it’ll quit on the first test failure, but --keep-going [Short form -k.] will continue to run all the tests until it’s done as much as it can. (This is close to essential for cases when you’ve made changes that you know will change the output of many tests, so that it’ll update all of their stdout and stderr files at once.)

How it works

Let’s pull apart those half dozen lines, bit by bit, to see how it works.

Figure 5.1: line one (formatted for possible increased clarity)

rwildcard = $(foreach d, $(wildcard $1*),
	$(call rwildcard, $d/, $2) $(filter $2, $d)
)

This rwildcard function is a recursive wildcard matcher. $(wildcard) does a single level of matching; this takes it to multiple levels.

I don’t intend to explain all the details of how this works, but here’s an approximation of the algorithm it’s achieving, in pseudocode:

Figure 5.1a: Figure 5.1 liberally translated into pseudocode.

function rwildcard(directory, pattern):
	for each item as path in directory:
		if path is a directory:
			emit all rwildcard(path, pattern)
		otherwise (path being a file):
			if path matches pattern:
				emit path

I use $(call rwildcard,,%.test) instead of $(shell find . -name \*.test) mainly for compatibility, so that the find binary (GNU findutils or another variant) is not required.

Figure 5.2: lines two and three.

.PHONY: test
test: $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))

The .PHONY: test rule marks test as a phony target, which is mostly not necessary, but speeds things up a very little and ensures that tests don’t break if you create a file with the name “test”. If you go marking individual tests as phony, the effect is that it won’t check the file modification times, and will just always rerun the test.

Then the rule for the actual test target. It has no recipe, which means that nothing extra happens when you run make test, after it runs all necessary tests: it is just the sum of its parts, no more.

Its parts? It depends on $(patsubst %.test,%.stdout,$(call rwildcard,,%.test)), which means “find all *.test files recursively ($(call rwildcard,,%.test)), then change each one’s ‘.test’ extension to ‘.stdout’ ($(patsubst %.test,%.stdout,…))”.

You may wonder why it needs to depend on foo.stdout rather than foo.test: this is because foo.stdout is the target that will run the test, while depending on foo.test would only ensure the existence of the test.

You may wonder why we search for the foo.test definition files and change their extensions to .stdout, rather than just searching for the .stdout files: this is so that we can create new tests without needing to also create a .stdout file manually.

The upshot of this is that we create a phony target called “test” which depends on the result of all of the tests. It would be incorrect to say that it runs all the tests; the test target doesn’t run the tests, but rather declares “I require the tests to have been run before you run me” [Consider how these things are properly named prerequisites rather than dependencies.] (and, as discussed, running the test target itself does nothing since it has no recipe). The way to make all the tests run each time is to mark them all as phony as well; otherwise, Make will look at their prerequisite trees and may observe them already satisfied, with the stdout file newer than the test file, and so say “that test has been run and doesn’t need to be run again”.

Figure 5.3: lines four, five and six.

%.stdout: %.test
	./$< > $@ 2> $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)
	git diff --exit-code --src-prefix=expected/ --dst-prefix=actual/ \
		$@ $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)

Now the meat of it: how an individual test is run.

We use an implicit rule so that we needn’t enumerate all the files (which we could do in a couple of ways, but it’d be much more painful).

This rule says “any file with extension ‘.stdout’ can be created based upon the file with the same base name but the extension ‘.test’, by running these two commands”.

At last we come to the recipe, how the .stdout file is created.

$@ and $< are automatic variables:

$@ expands to the name of the target, which we’ll call foo.stdout.
$< expands to the name of the first prerequisite, which in this case will be foo.test.

The other piece of Make magic is $(patsubst %.stdout,%.stderr,$@), which will end up foo.stderr.

(That the backslash at the end of a line is a line continuation is important too. Each line of a recipe is invoked separately. [You can opt out of this behaviour with .ONESHELL:, but that was introduced in GNU Make 3.82, and macOS includes the ancient GNU Make 3.81 for licensing reasons, so consider compatibility before using it.])

So, what is run is this:

Figure 6: what is run, for make foo.stdout.

./foo.test > foo.stdout 2> foo.stderr \
	|| (touch --date=@0 foo.stdout; false)
git diff --exit-code --src-prefix=expected/ --dst-prefix=actual/ \
	foo.stdout foo.stderr \
	|| (touch --date=@0 foo.stdout; false)

What this does:

Run ./foo.test, piping stdout to foo.stdout and stderr to foo.stderr.
If that fails: zero foo.stdout’s mtime [That is, update the file’s “last modified” time to the Unix epoch, 1970-01-01T00:00:00Z.] so that it’s older than foo.test (so that a subsequent invocation of Make won’t consider the foo.stdout target to be satisfied; you could also delete the file, but that would be less useful), and then fail (which will cause Make to stop executing the recipe, and report failure).
Run git diff in such a way as to print any unstaged changes to those files (that is: any way in which the test output didn’t match what was expected).
If there were any unstaged changes to those files, then zero foo.stdout’s mtime (for the same reason as before) and report failure.

Making test output deterministic

As written, we’re just doing a naïve diff, assuming that the output of running a command is the same each time.

In practice, there are often slight areas of variation, such as timestamps or time duration figures.

For example, if you have something that produces URLs with ?t=timestamp or ?hash cache‐busting, you might wish to zero the timestamps or turn the hashes into a constant value. Cleaning the output might look like this, if your test file is a shell script:

Figure 7: using sed to sanitise stochastic output.

#!/bin/sh

run () {
	…
}

run | sed 's/?t=[0-9]\{10\}/?t=0/g'

This general approach allows you to discard sources of randomness, or even to quantise it (e.g. discard milliseconds but keep seconds), but makes it hard to do anything fancier, like numeric comparisons to check that a value is within a given error margin—if you’re not careful, you start writing a test framework rather than using git diff as the test framework. [If you really want to go off the deep end here, start thinking about how Git filter attributes might be applied. But I recommend not doing so, even though it’d be possible!]

Limitations

Here is a non‐exhaustive list of some limitations of this approach.

Declaring dependencies properly is often infeasible in current languages and environments, so you can end up needing overly‐broad prerequisites, like “this test depends on all of the source files”, and so more tests may be run than are actually needed.
Spinning up processes all over the place can be expensive. Interpreters commonly take hundreds of milliseconds to start, to say nothing of how long it can take to import code, which you now need to do once per test rather than once overall.
The filtering of output to make it deterministic effectively limits you to equality checking, and not comparisons. To a substantial extent, this is a test framework with only assertions, and little or no logic.
Make does not play well with spaces in filenames.
Git doesn’t manage file modification times in any way. This can become a nuisance once you’re committing what are essentially build artefacts, since Make makes decisions based on mtimes. So long as your dependencies are all specified properly it shouldn’t cause any damage [In the general case, this is not actually quite true; I had a case at work a few months ago where I temporarily added a build artefact to the repository, and it normally worked fine, but just occasionally the mtimes would be back to front and the build server would try to rebuild it and fail, because its network access was locked down but building that particular artefact required internet access.] if you don’t modify the stdout file with Git, but it may lead to unnecessary running of a test. When you want to play it safe, make --always-make test will be your friend.

Conclusion

When you’re in an ecosystem that provides a test harness, you should probably use it; but outside such ecosystems, the power of shell scripting, makefiles and version control can really work nicely to produce a good result with minimal effort.

There are various limitations to the approach, but for many things it works really well, and can scale up a lot. I especially like the way that it tracks the expected output, making manual inspection straightforward and updating trivial.

I’ve used something similar to this approach before, and I’ve found makefiles in general to be very effective on various matters, small and large; I think this technique demonstrates some of the nifty power of Make.

The GNU Make documentation is pretty good. It’s sometimes hard to find what you’re looking for in the index, but the information is definitely all there.