Using make and git diff for a simple and powerful test harness

People regularly go for complex test harnesses; and when you’re in an ecosystem, they can be a good fit. But outside ecosystems, it’s actually quite easy to produce a fairly good test harness.

First, a demo. Here’s what is achieved with only half a dozen lines [(Depending on how you count it.)] of makefile:

Figure 1: a terminal recording demonstrating the test harness in action.

(It’s moderately fast‐paced, so pause as needed if you want to stop and consider rather than letting it wash over you.)

⚠️There’s supposed to be a terminal recording player here, but it looks like the JavaScript is still loading, or you’ve blocked first-party JavaScript.

If you want to play the recording without JavaScript, you can play it in any software that supports the ttyrec format. I use termrec, with which this will play it:

$ termplay

The makefile

Let’s jump right in with a sample makefile, typically with the filename Makefile:

Figure 2: contents of Makefile.

rwildcard = $(foreach d,$(wildcard $1*),$(call rwildcard,$d/,$2) $(filter $2,$d))

.PHONY: test
test: $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))

%.stdout: %.test
	./$< > $@ 2> $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)
	git diff --exit-code --src-prefix=expected/ --dst-prefix=actual/ \
		$@ $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)

That’s all the code needed for a fully functioning test harness.

If you’re familiar with makefiles this may be enough for you to understand this whole post and you can stop reading. The remainder of this post explains how to use this, how it works, and the limitations of the approach.

Note that this is all for GNU Make. Other Make implementations may not support all of the foreach, wildcard, call, filter and patsubst functions, so you could need to write out the test target’s prerequisites another way. The general principle is sound, however.

Getting started with it

  1. Make sure you have GNU Make installed and are using Git.
  2. Paste the code above into a file called Makefile.
  3. Create an executable file [That is, on Unixy platforms it’ll need to have mode +x, and either be an executable binary or start with a shebang.] called foo.test that contains the script of what you want to test.
  4. Run make test (the test run will seem to succeed, since the expected output hasn’t been added to Git yet [Most things in Git ignore untracked files; and git diff foo.stdout foo.stderr outputs nothing and reports success when those two files aren’t tracked.]).
  5. Review the contents of foo.stdout and foo.stderr which have just been created, that they match what you expected.
  6. Run git add foo.test foo.stdout foo.stderr.

This is not intrinsically tied to Git; you could replace the git diff invocation with something semantically equivalent from another version control system.

How to use it

As you can hopefully see, this is simple; only half a dozen lines. Yet it’s very powerful: by using Make, you get all sorts of nice magic for free.

If tests vary on other inputs (e.g. test data or build artefacts), you can just add new prerequisites to the targets:

Figure 3: adding build prerequisites to tests. Here, all tests runs depend on user-data.csv, and the “bar” test depends on bin/bar.

%.stdout: %.test user-data.csv

bar.stdout: bin/bar

And you can get much fancier about prerequisites if you desire. These things can allow you to only run the tests whose inputs (whether it be data or code) have changed, rather than all the tests.

If you always want to run all the tests, which you probably want to do if you haven’t set up precise dependency tracking, mark all of the %.stdout targets phony (which will be explained below):

Figure 4: ensuring that all of the tests will be run each time you run make test.

.PHONY: test $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))

If you just once want to run all the tests, run make --always-make test [The short form of --always-make is -B, but I recommend long options in most places.].

If you want to run tests concurrently, use make --jobs=8 test [Short form -j8, and this is one that I do use short form for.] or similar. (Caution: combined with --keep-going, you may get nonsense output to the terminal, with lines from multiple diffs interleaved.)

By default it’ll quit on the first test failure, but --keep-going [Short form -k.] will continue to run all the tests until it’s done as much as it can. (This is close to essential for cases when you’ve made changes that you know will change the output of many tests, so that it’ll update all of their stdout and stderr files at once.)

How it works

Let’s pull apart those half dozen lines, bit by bit, to see how it works.

Figure 5.1: line one (formatted for possible increased clarity)

rwildcard = $(foreach d, $(wildcard $1*),
	$(call rwildcard, $d/, $2) $(filter $2, $d)

This rwildcard function is a recursive wildcard matcher. $(wildcard) does a single level of matching; this takes it to multiple levels.

I don’t intend to explain all the details of how this works, but here’s an approximation of the algorithm it’s achieving, in pseudocode:

Figure 5.1a: Figure 5.1 liberally translated into pseudocode.

function rwildcard(directory, pattern):
	for each item as path in directory:
		if path is a directory:
			emit all rwildcard(path, pattern)
		otherwise (path being a file):
			if path matches pattern:
				emit path

I use $(call rwildcard,,%.test) instead of $(shell find . -name \*.test) mainly for compatibility, so that the find binary (GNU findutils or another variant) is not required.

Figure 5.2: lines two and three.

.PHONY: test
test: $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))

The .PHONY: test rule marks test as a phony target, which is mostly not necessary, but speeds things up a very little and ensures that tests don’t break if you create a file with the name “test”. If you go marking individual tests as phony, the effect is that it won’t check the file modification times, and will just always rerun the test.

Then the rule for the actual test target. It has no recipe, which means that nothing extra happens when you run make test, after it runs all necessary tests: it is just the sum of its parts, no more.

Its parts? It depends on $(patsubst %.test,%.stdout,$(call rwildcard,,%.test)), which means “find all *.test files recursively ($(call rwildcard,,%.test)), then change each one’s ‘.test’ extension to ‘.stdout’ ($(patsubst %.test,%.stdout,))”.

You may wonder why it needs to depend on foo.stdout rather than foo.test: this is because foo.stdout is the target that will run the test, while depending on foo.test would only ensure the existence of the test.

You may wonder why we search for the foo.test definition files and change their extensions to .stdout, rather than just searching for the .stdout files: this is so that we can create new tests without needing to also create a .stdout file manually.

The upshot of this is that we create a phony target called “test” which depends on the result of all of the tests. It would be incorrect to say that it runs all the tests; the test target doesn’t run the tests, but rather declares “I require the tests to have been run before you run me” [Consider how these things are properly named prerequisites rather than dependencies.] (and, as discussed, running the test target itself does nothing since it has no recipe). The way to make all the tests run each time is to mark them all as phony as well; otherwise, Make will look at their prerequisite trees and may observe them already satisfied, with the stdout file newer than the test file, and so say “that test has been run and doesn’t need to be run again”.

Figure 5.3: lines four, five and six.

%.stdout: %.test
	./$< > $@ 2> $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)
	git diff --exit-code --src-prefix=expected/ --dst-prefix=actual/ \
		$@ $(patsubst %.stdout,%.stderr,$@) \
		|| (touch --date=@0 $@; false)

Now the meat of it: how an individual test is run.

We use an implicit rule so that we needn’t enumerate all the files (which we could do in a couple of ways, but it’d be much more painful).

This rule says “any file with extension ‘.stdout’ can be created based upon the file with the same base name but the extension ‘.test’, by running these two commands”.

At last we come to the recipe, how the .stdout file is created.

$@ and $< are automatic variables:

The other piece of Make magic is $(patsubst %.stdout,%.stderr,$@), which will end up foo.stderr.

(That the backslash at the end of a line is a line continuation is important too. Each line of a recipe is invoked separately. [You can opt out of this behaviour with .ONESHELL:, but that was introduced in GNU Make 3.82, and macOS includes the ancient GNU Make 3.81 for licensing reasons, so consider compatibility before using it.])

So, what is run is this:

Figure 6: what is run, for make foo.stdout.

./foo.test > foo.stdout 2> foo.stderr \
	|| (touch --date=@0 foo.stdout; false)
git diff --exit-code --src-prefix=expected/ --dst-prefix=actual/ \
	foo.stdout foo.stderr \
	|| (touch --date=@0 foo.stdout; false)

What this does:

  1. Run ./foo.test, piping stdout to foo.stdout and stderr to foo.stderr.
  2. If that fails: zero foo.stdout’s mtime [That is, update the file’s “last modified” time to the Unix epoch, 1970-01-01T00:00:00Z.] so that it’s older than foo.test (so that a subsequent invocation of Make won’t consider the foo.stdout target to be satisfied; you could also delete the file, but that would be less useful), and then fail (which will cause Make to stop executing the recipe, and report failure).
  3. Run git diff in such a way as to print any unstaged changes to those files (that is: any way in which the test output didn’t match what was expected).
  4. If there were any unstaged changes to those files, then zero foo.stdout’s mtime (for the same reason as before) and report failure.

Making test output deterministic

As written, we’re just doing a naïve diff, assuming that the output of running a command is the same each time.

In practice, there are often slight areas of variation, such as timestamps or time duration figures.

For example, if you have something that produces URLs with ?t=timestamp or ?hash cache‐busting, you might wish to zero the timestamps or turn the hashes into a constant value. Cleaning the output might look like this, if your test file is a shell script:

Figure 7: using sed to sanitise stochastic output.


run () {

run | sed 's/?t=[0-9]\{10\}/?t=0/g'

This general approach allows you to discard sources of randomness, or even to quantise it (e.g. discard milliseconds but keep seconds), but makes it hard to do anything fancier, like numeric comparisons to check that a value is within a given error margin—​if you’re not careful, you start writing a test framework rather than using git diff as the test framework. [If you really want to go off the deep end here, start thinking about how Git filter attributes might be applied. But I recommend not doing so, even though it’d be possible!]


Here is a non‐exhaustive list of some limitations of this approach.


When you’re in an ecosystem that provides a test harness, you should probably use it; but outside such ecosystems, the power of shell scripting, makefiles and version control can really work nicely to produce a good result with minimal effort.

There are various limitations to the approach, but for many things it works really well, and can scale up a lot. I especially like the way that it tracks the expected output, making manual inspection straightforward and updating trivial.

I’ve used something similar to this approach before, and I’ve found makefiles in general to be very effective on various matters, small and large; I think this technique demonstrates some of the nifty power of Make.

The GNU Make documentation is pretty good. It’s sometimes hard to find what you’re looking for in the index, but the information is definitely all there.