First, a demo. Here’s what is achieved with only half a dozen lines [(Depending on how you count it.)] of makefile:
The makefile
Let’s jump right in with a sample makefile, typically with the filename Makefile
:
That’s all the code needed for a fully functioning test harness.
If you’re familiar with makefiles this may be enough for you to understand this whole post and you can stop reading. The remainder of this post explains how to use this, how it works, and the limitations of the approach.
Note that this is all for GNU Make. Other Make implementations may not support all of the foreach
, wildcard
, call
, filter
and patsubst
functions, so you could need to write out the test
target’s prerequisites another way. The general principle is sound, however.
Getting started with it
- Make sure you have GNU Make installed and are using Git.
- Paste the code above into a file called
Makefile
.
- Create an executable file [That is, on Unixy platforms it’ll need to have mode
+x
, and either be an executable binary or start with a shebang.] called foo.test
that contains the script of what you want to test.
- Run
make test
(the test run will seem to succeed, since the expected output hasn’t been added to Git yet [Most things in Git ignore untracked files; and git diff … foo.stdout foo.stderr
outputs nothing and reports success when those two files aren’t tracked.]).
- Review the contents of
foo.stdout
and foo.stderr
which have just been created, that they match what you expected.
- Run
git add foo.test foo.stdout foo.stderr
.
This is not intrinsically tied to Git; you could replace the git diff
invocation with something semantically equivalent from another version control system.
How to use it
As you can hopefully see, this is simple; only half a dozen lines. Yet it’s very powerful: by using Make, you get all sorts of nice magic for free.
If tests vary on other inputs (e.g. test data or build artefacts), you can just add new prerequisites to the targets:
And you can get much fancier about prerequisites if you desire. These things can allow you to only run the tests whose inputs (whether it be data or code) have changed, rather than all the tests.
If you always want to run all the tests, which you probably want to do if you haven’t set up precise dependency tracking, mark all of the %.stdout
targets phony (which will be explained below):
If you just once want to run all the tests, run make --always-make test
[The short form of --always-make
is -B
, but I recommend long options in most places.].
If you want to run tests concurrently, use make --jobs=8 test
[Short form -j8
, and this is one that I do use short form for.] or similar. (Caution: combined with --keep-going
, you may get nonsense output to the terminal, with lines from multiple diffs interleaved.)
By default it’ll quit on the first test failure, but --keep-going
[Short form -k
.] will continue to run all the tests until it’s done as much as it can. (This is close to essential for cases when you’ve made changes that you know will change the output of many tests, so that it’ll update all of their stdout and stderr files at once.)
How it works
Let’s pull apart those half dozen lines, bit by bit, to see how it works.
This rwildcard
function is a recursive wildcard matcher. $(wildcard)
does a single level of matching; this takes it to multiple levels.
I don’t intend to explain all the details of how this works, but here’s an approximation of the algorithm it’s achieving, in pseudocode:
I use $(call rwildcard,,%.test)
instead of $(shell find . -name \*.test)
mainly for compatibility, so that the find
binary (GNU findutils or another variant) is not required.
The .PHONY: test
rule marks test
as a phony target, which is mostly not necessary, but speeds things up a very little and ensures that tests don’t break if you create a file with the name “test”. If you go marking individual tests as phony, the effect is that it won’t check the file modification times, and will just always rerun the test.
Then the rule for the actual test
target. It has no recipe, which means that nothing extra happens when you run make test
, after it runs all necessary tests: it is just the sum of its parts, no more.
Its parts? It depends on $(patsubst %.test,%.stdout,$(call rwildcard,,%.test))
, which means “find all *.test files recursively ($(call rwildcard,,%.test)
), then change each one’s ‘.test’ extension to ‘.stdout’ ($(patsubst %.test,%.stdout,…)
)”.
You may wonder why it needs to depend on foo.stdout
rather than foo.test
: this is because foo.stdout
is the target that will run the test, while depending on foo.test
would only ensure the existence of the test.
You may wonder why we search for the foo.test
definition files and change their extensions to .stdout
, rather than just searching for the .stdout
files: this is so that we can create new tests without needing to also create a .stdout
file manually.
The upshot of this is that we create a phony target called “test” which depends on the result of all of the tests. It would be incorrect to say that it runs all the tests; the test target doesn’t run the tests, but rather declares “I require the tests to have been run before you run me” [Consider how these things are properly named prerequisites rather than dependencies.] (and, as discussed, running the test target itself does nothing since it has no recipe). The way to make all the tests run each time is to mark them all as phony as well; otherwise, Make will look at their prerequisite trees and may observe them already satisfied, with the stdout file newer than the test file, and so say “that test has been run and doesn’t need to be run again”.
Now the meat of it: how an individual test is run.
We use an implicit rule so that we needn’t enumerate all the files (which we could do in a couple of ways, but it’d be much more painful).
This rule says “any file with extension ‘.stdout’ can be created based upon the file with the same base name but the extension ‘.test’, by running these two commands”.
At last we come to the recipe, how the .stdout file is created.
$@
and $<
are automatic variables:
$@
expands to the name of the target, which we’ll call foo.stdout
.
$<
expands to the name of the first prerequisite, which in this case will be foo.test
.
The other piece of Make magic is $(patsubst %.stdout,%.stderr,$@)
, which will end up foo.stderr
.
(That the backslash at the end of a line is a line continuation is important too. Each line of a recipe is invoked separately. [You can opt out of this behaviour with .ONESHELL:
, but that was introduced in GNU Make 3.82, and macOS includes the ancient GNU Make 3.81 for licensing reasons, so consider compatibility before using it.])
So, what is run is this:
What this does:
- Run
./foo.test
, piping stdout to foo.stdout
and stderr to foo.stderr
.
- If that fails: zero
foo.stdout
’s mtime [That is, update the file’s “last modified” time to the Unix epoch, 1970-01-01T00:00:00Z.] so that it’s older than foo.test
(so that a subsequent invocation of Make won’t consider the foo.stdout
target to be satisfied; you could also delete the file, but that would be less useful), and then fail (which will cause Make to stop executing the recipe, and report failure).
- Run
git diff
in such a way as to print any unstaged changes to those files (that is: any way in which the test output didn’t match what was expected).
- If there were any unstaged changes to those files, then zero
foo.stdout
’s mtime (for the same reason as before) and report failure.
Making test output deterministic
As written, we’re just doing a naïve diff, assuming that the output of running a command is the same each time.
In practice, there are often slight areas of variation, such as timestamps or time duration figures.
For example, if you have something that produces URLs with ?t=timestamp
or ?hash
cache‐busting, you might wish to zero the timestamps or turn the hashes into a constant value. Cleaning the output might look like this, if your test file is a shell script:
This general approach allows you to discard sources of randomness, or even to quantise it (e.g. discard milliseconds but keep seconds), but makes it hard to do anything fancier, like numeric comparisons to check that a value is within a given error margin—if you’re not careful, you start writing a test framework rather than using git diff
as the test framework. [If you really want to go off the deep end here, start thinking about how Git filter attributes might be applied. But I recommend not doing so, even though it’d be possible!]
Limitations
Here is a non‐exhaustive list of some limitations of this approach.
-
Declaring dependencies properly is often infeasible in current languages and environments, so you can end up needing overly‐broad prerequisites, like “this test depends on all of the source files”, and so more tests may be run than are actually needed.
-
Spinning up processes all over the place can be expensive. Interpreters commonly take hundreds of milliseconds to start, to say nothing of how long it can take to import code, which you now need to do once per test rather than once overall.
-
The filtering of output to make it deterministic effectively limits you to equality checking, and not comparisons. To a substantial extent, this is a test framework with only assertions, and little or no logic.
-
Make does not play well with spaces in filenames.
-
Git doesn’t manage file modification times in any way. This can become a nuisance once you’re committing what are essentially build artefacts, since Make makes decisions based on mtimes. So long as your dependencies are all specified properly it shouldn’t cause any damage [In the general case, this is not actually quite true; I had a case at work a few months ago where I temporarily added a build artefact to the repository, and it normally worked fine, but just occasionally the mtimes would be back to front and the build server would try to rebuild it and fail, because its network access was locked down but building that particular artefact required internet access.] if you don’t modify the stdout file with Git, but it may lead to unnecessary running of a test. When you want to play it safe, make --always-make test
will be your friend.
Conclusion
When you’re in an ecosystem that provides a test harness, you should probably use it; but outside such ecosystems, the power of shell scripting, makefiles and version control can really work nicely to produce a good result with minimal effort.
There are various limitations to the approach, but for many things it works really well, and can scale up a lot. I especially like the way that it tracks the expected output, making manual inspection straightforward and updating trivial.
I’ve used something similar to this approach before, and I’ve found makefiles in general to be very effective on various matters, small and large; I think this technique demonstrates some of the nifty power of Make.
The GNU Make documentation is pretty good. It’s sometimes hard to find what you’re looking for in the index, but the information is definitely all there.