Jamie's Blog

Up and Running with pytest, Hypothesis and tox

Tags: Programming

This is an article I wrote in tandem with a lunch-hour presentation for my work. If you’re in a hurry, you can get a quick overview from my slides. I’ll assume you’re familiar with unit-testing as a concept (see the slides for a bit of a refresher) and just want a quick overview of useful tools available to Python. With that said, let’s get stuck in!

[Note that this article has recently been recovered following a fire. Some formatting issues may be present].

There are two major testing frameworks in Python: unittest, which is provided in the standard library, and pytest, which isn’t1. unittest is based heavily on the Java JUnit test library, with a strong focus on object orientation, and despite its standard-library status, it can be criticised as “unpythonic.” In this guide we’ll look at pytest, which I’ll assert is “more pythonic,” without providing any satisfying justification for that claim.

pytest on its own is great and all, but we won’t stop there. I’ll go on to introduce Hypothesis, which brings property-based testing to Python. Property-based testing generates many test inputs automatically according to a search strategy, and you check that your function satisfies some property for all of them, making it really useful for numerical code.

With pytest and Hypothesis, we can write a range of powerful and flexible tests for our project, executing them with the Python interpreters and libraries installed in our development environment. But what if we want to support a range of Python versions, or different library configurations? This is where tox comes in, allowing you to describe your tests in a simple tox.ini file, specifying which test frameworks to run and which Python versions to use. Writing a tox file only takes a couple of minutes once you’ve got everything else set up – even better, tox eases test integration with CI/CD tools.

Setup

To follow this tutorial, you’ll need to get set up with pytest, Hypothesis and tox. They are likely to be available from your system package repository (apt-get, pacman), but you can install them with pip directly (I recommend using a virtual environment):

pip install pytest hypothesis tox

pytest

In the language of examples:

# test/test_all.py
def mul1(x):
    return x*1

def test_mul1_identity():
    assert mul1(10) == 10
$ ls
test
$ pytest
================== test session starts ==================
platform linux -- Python 3.8.1, pytest-5.3.4, py-1.8.1, pluggy-0.13.1
rootdir: /home/jamie/pytesttest
plugins: hypothesis-4.54.2
collected 1 item

test/test_all.py .                               [100%]

================== 1 passed in 0.01s ====================

It’s as simple as that! There isn’t any pytest-specific code here – just built-in Python assert. pytest is a “No API” test framework – you can achieve most things without explicitly calling any pytest functions. First, pytest searches for .py files starting with test_; then, it calls every function prefixed with test_. If a function raises an assertion, it presents it to you as a failure.

“That’s all well and good for a simple example,” I hear you doubt. “But how about a more realistic project structure?” Well, we’ve started out in the right direction by making a folder called test. In fact, pytest doesn’t care what the folder is called, but for our benefit it separates our tests from our package code. Let’s make a package my_pkg, and for simplicity we’ll work in its __init__.py:

# my_pkg/__init__.py
def mul1(x):
    return x*1
# test/test_all.py
import my_pkg

def test_mul1_identity():
    assert my_pkg.mul1(10) == 10

Great! All we have to do is run pytest again, and we get… a big fat error. It can’t import my_pkg, because it runs each test file from its own directory, not the directory we run it from. The solution is simple:

$ python -m pytest
================== test session starts ==================
platform linux -- Python 3.8.1, pytest-5.3.4, py-1.8.1, pluggy-0.13.1
rootdir: /home/jamie/pytesttest
plugins: hypothesis-4.54.2
collected 1 item

test/test_all.py .                                [100%]

=================== 1 passed in 0.02s ===================

Running pytest with python -m 2 adds your current directory to the Python path, so it can find your module is in the Python path. If you have trouble at this stage, you should check all the usual things – Python and pytest version, virtual environments, PYTHONPATH, &c.

Now, we can add as many test files, modules, or even packages if we want (there are several ways to lay out a project with pytest, this is just one).

Meeting test dependencies with fixtures

“That’s all well and good…” I hear you ‘that’s all well and good’…

The example above works just fine for stateless3 functions. It’s less useful if your functions have some stateful dependency, such as a singleton, database connection, or file handle; it’s also a pain if you have some test data your tests should operate on. You don’t want test cases to contain long repeated code for initialisation and teardown. Such dependencies are known as fixtures of your tests, and in pytest they work like so:

from pytest import fixture

@fixture
def my_context():
    return {"a": 7, "b": "hello", "foo": [1,2,3]}

def test_frobnicator(my_context):
    assert frobnicate(my_context) is not None
    assert my_context["a"] == 7

@fixture registers the my_context function, a name we chose, as a test fixture. Then, whenever it encounters a test_ function with my_context as an argument, it’s smart enough to know you want the value returned from calling the my_context() function. This function gets called for every test function that uses it.

Sometimes you want to avoid the cost of calling a fixture many times, in which case you can use the scope argument with one of "function", "class", "module", "package"4 or "session":

@fixture(scope="session")
def image():
    # Only called once per pytest execution
    return imread("data/apples.png")

def test_count_apples(image):
    assert count_apples(image) == 4

To supply teardown code, you simply yield the result, and clean up after:

@fixture
def fixture_with_teardown():
    f = open("test_file")
    yield f
    f.close()

Even better than that, you can use with and yield for objects that support it:

@fixture
def csv_file():
    with open("data/test_data.csv") as f:
        yield f

Here, f is only valid within the with block. By yielding, Python semantics keeps the with block open while our test function does something with f. Then, once our test has finished, pytest returns control back to the csv_file function, cleanly exiting the with.

In fact, you can choose to yield nothing at all:

@fixture
def read_from_singleton():
    evil_singleton.instance.read_flag = True
    yield
    evil_singleton.instance.read_flag = False

This is an intentionally gross example – hopefully you won’t have to do something like this in real code – but it shows that fixtures can be used for generic teardown code, not just things that actually return a value.

You can also parameterise a fixture that should work for several different settings. Each parameter will be tried in turn, running each test-case once for each one.

@fixture(parameters=["s3://test-bucket.amazonaws.com", "gs://test-project/"])
def bucket_handle(uri):
    # Imaginary "BucketHandle" class
    return BucketHandle(uri)

def test_load(bucket_handle):
    f = bucket_handle.load("test_file")
    assert f.read() == "test string"

You’ll often find that a particular fixture will be used in several test files. In this case, you can move them to test/conftest.py, which pytest will automatically include in all tests.

Hypothesis

What if we wanted to run the same test many times, with different input each time? For example, what if we have a function which takes any integer, with the requirement that its result is never 0 (test_func_never_equals_zero)? We could define a fixture which returns a random integer, but pytest would still create at most one unique input for each individual test-case in one run. We could use the parameters argument and just provide a huge list of integers or a range, but we would either search inefficiently ([1,2,3...]) or have to hand-craft a search strategy to each fixture.

This is a great use-case for property-based testing. Rather than specifying the test examples manually, with property-based testing you specify the kinds of inputs you want (integers), and then the property you expect to hold for those inputs (not equal to zero). Property-based testing will feel familiar if you’ve ever used the Haskell library QuickCheck, or formal specification tools like Z or B. You can also think of it as a mix of unit-testing and fuzz testing, if you squint a bit.

In Python, property-based testing is provided by the Hypothesis library. Don’t worry, though, you don’t have to learn a whole new testing framework – Hypothesis runs as a pytest plugin, letting you write and run property tests without any extra configuration. In fact, the only difference is in how a test gets its input.

Berked Example

What if we have a function, extract_float, whose job is to find a float in a string and return it:

# my_pkg.__init__
import re
def extract_float(string: str):
    if match := re.search(r"\d+(?:\.\d+)?",string):
        return float(match[0])

This code looks like an OK first attempt - we take our string, pass it to some inscrutable regex we copied off StackOverflow, and convert the result to a float5.

Let’s write a simple pytest test case for this:

def test_extract_float():
    f = 1.2345
    s = f"My float is {f}"
    assert my_pkg.extract_float(s) == f

If you write this out and run it, it passes! Job’s a good’un. But before we crack open the champagne, let’s see how we might write the same test with Hypothesis:

from hypothesis import given
from hypothesis.strategies import floats

@given(floats())
def test_extract_float_better(test_float):
    s = f"My float is {test_float}"
    assert my_pkg.extract_float(s) == test_float

Running pytest as before, we get a load of output – extracting the relevant parts, we see:

------------------------- Hypothesis --------------------------
Falsifying example: test_extract_float_better(
    test_float=-1.0,
)

Merde. Well, that’s fine, we just need to add the possibility of negative floats to our regex: r"-?\d+(?:\.\d+)?". Notice that Hypothesis has given us a very nice number to work with, -1; it could have chosen any negative number, but when it finds a falsifying example, it then tries to simplify it, narrowing down the cause of the failure and making our lives a little easier6.

When we fix that and run it we again find:

------------------------------- Hypothesis -------------------------------
Falsifying example: test_extract_float_better(
    test_float=1e+16,
)

Oh, yeah - that’s a valid float too. Now we can decide if our function should handle that case – let’s rewrite our test to exclude it. While we’re at it, let’s add our -1 example to ensure it gets tested in future; this way, we reduce the risk of regressions in refactoring:

from hypothesis import given, example
from hypothesis.strategies import floats

@given(floats())
@example(-1.) # test_float = -1. will always be tested
def test_extract_float_better(test_float):
    s = f"My float is {test_float:f}"
    assert my_pkg.extract_float(s) == test_float

Hypothesis responds with:

Falsifying example: test_extract_float_better(
    test_float=0.5078125,
)

It’s kind of like a know-it-all kid that keeps poking holes in everything you say and makes you want to strangle it. It’s great. You should use it.

We fix that 7, and then it complains about inf, -inf, and nan. We don’t want our parser function to handle those cases - to stop Hypothesis nagging about them, we can change the call to the floats strategy:

@given(floats(allow_nan=False, allow_infinity=False))

With all that in place – our test passes!

It may seem that this was a lot of unncessary hassle for a fairly straightforward function – our test code is longer than the function we’re testing! On the other hand, using Hypothesis made us confront our assumptions about what string values map to a valid float, as well as poking holes in the test we wrote to verify it. To fix it, we had to make design decisions about our function we hadn’t even thought of, and specify the valid range of inputs more precisely. All of this should increase our confidence that our code does what we think it does, and our test actually tests that behaviour.

More complicated examples

If only most useful tests only took floats. Luckily for us, Hypothesis provides a wealth of more specific and complicated types, including a NumPy strategy called arrays.

Lists

First, let’s extend our extract_float function to make extract_floats:

# my_pkg.__init__
def extract_floats(s: str):
    return [float(m) for m in re.findall(r"-?\d+(?:\.\d+)?",s)]

To test this, we use the lists strategy, which takes another strategy as its first argument. We’ll set the max_size to 100, or our parser might prove a bit slow.

from hypothesis import given
from hypothesis.strategies import floats, lists

valid_floats = floats(allow_nan=False, allow_infinity=False)

@given(lists(valid_floats, max_size=100))
def test_extract_floats(test_floats):
    as_strings = map("{:f}".format, test_floats)
    input_string = "My test " + ",".join(as_strings)

    output = my_pkg.extract_floats(input_string)

    diffs = (math.fabs(x-y) for x,y in zip(output, test_floats))
    assert len(output) == len(test_floats) and all((d < 0.00001 for d in diffs))

The actual example generation is a bit uglier for this particular problem, but the principle hasn’t changed. We can now be fairly sure that, at least in simple strings, our function is working as we expect.

Foo’d for thought: builds and composites

Floats, lists of floats, strings, lists of dicts of strings and floats – all good. But what about purpose-built objects? How can we sample the state space of our own Foo(int, string) class?

The first way is to use builds:

from hypothesis import given
from hypothesis.strategies import builds, integers, text

foos = builds(Foo, integers(), text())

@given(foos)
def test_foo_property(foo): pass

In fact, if we use type annotations in our Foo definition, Foo(a: int, b: str), builds is smart enough to work out those strategies on its own from just builds(Foo).

This works great if the arguments are independent, but falls apart when one constrains another. For example, what if the integer argument gives the maximum length of the string? In those cases, the most obvious way is to use a composite strategy. This sounds scary, but it isn’t:

from hypothesis import given, example
from hypothesis.strategies import composite, integers, text

@composite
def foos(draw):
    i = draw(integers())  # i is just an int
    s = draw(text(max_size=i))  # s is just a str with len(s) <= i
    return Foo(i,s)

@given(foos)
def test_foo_property(foo): pass

The magical draw parameter is passed to all @composite functions, and allows you to take a single example from a strategy. This allows you to draw from several strategies in a row, building up dependencies between them. A real world example from my own code was a Rectangle class specified by opposite corners; \(x_1\) and \(x_2\) are both integers, but we also know that \(x_2 > x_1\). In this case, we could also use hypothesis.assume(rect.x2 > rect.x1) directly in our test-case, but this leads Hypothesis to generate examples it doesn’t use, so composite is a better fit.

tox

Ok: maybe you kind of understand everything above, or maybe you found it really obvious. In any case, you’d be forgiven for feeling like your brain’s going to fall out if you have to learn one more thing about testing today. Luckily, tox isn’t about testing – it’s about test automation. And it’s really easy to set up:

# tox.ini
[tox]
envlist = py27,py36,py38

[testenv]
deps =
    pytest
    hypothesis
commands = python -m pytest
# setup.py
from setuptools import setup, find_packages
setup(
    name="my_package",
    version="0.1",
    packages=find_packages(),
)

and run:

$ tox

On my computer, this builds each environment in turn, spews out a load of errors, and ends with a summary:

ERROR:   py27: commands failed
ERROR:  py36: InterpreterNotFound: python3.6
  py38: commands succeeded

If we look at the errors for py27, we see that we’ve used a number of Python3-specific syntax (including 3.8’s walrus operator). I don’t have a 3.6 interpreter, so tox has failed it by default. You must have the interpreters you want installed and on your PATH for tox to find them - alternatively, set skip_missing_interpreters = true in your [tox] config block.

tox achieves three things: it runs your tests in an isolated environment built according to your setup.py 8, which tells you if you have unspecified dependences; it allows you to test multiple Python versions and runtime environments, ensuring that your code runs in the environments you need it to; and it makes running your tests as simple as running the tox command, for other users or automated CI systems.

The basic tox.ini above will suit a wide range of Python projects, but more complicated configurations are possible. The full configuration options are documented here, and there are many examples in the official documentation.

And that’s a wrap!


  1. There’s also nose2, which extends unittest, but I’ve never looked into it.↩︎

  2. python -m runs pytest as an executable module.↩︎

  3. Jargon: referentially transparent.↩︎

  4. "package" is experimental↩︎

  5. If no match is found, this function will return None.↩︎

  6. In fact, the curious can see all the examples generated in a Hypothesis run by plonking a print statement in our test-case.↩︎

  7. Left as an exercise to the reader↩︎

  8. tox also works with Poetry, which is superior to setup.py in almost every way.↩︎