Dealing with flaky tests
In my work I’m often dealing with flaky tests due to race conditions. Here are a few things I’ve picked up along the way that can help to deal with them.
Identify ¶
To fix flaky tests you first have to know about them. One approach to find them that I’ve found to work well is to:
- Always run all tests on each PR
- Only merge if green
- Run all tests periodically on the main branch and track the failures. Any failure on the main branch is per definition a test that’s flaky (assuming you pin dependencies).
Reproduce ¶
If you’re dealing with a race condition it most likely won’t be obvious from the failure alone what’s going wrong and you need more information. You’ll also need a way to reasonable confirm that you’ve fixed the issue. That means it helps a lot if you can reproduce the issue.
One clumsy way to reproduce the issue is to run the test a lot. I have a little script like this:
#!/usr/bin/env zsh
set -Eeuo pipefail
start_time=$(date +%s)
"$@"
while [ $? -eq 0 ]; do
duration=$(($(date +%s) - $start_time))
echo -n -e "\033]0;$(( $duration / 60 )) min $(( $duration % 60 )) sec: $@\007"
"$@"
done
And I often compose it with timeout
and use it like this:
timeout 60m until-error ./mvnw test [...]; echo "test run finished" | speak
speak
being a wrapper around piper-tts
:
#!/usr/bin/env bash
set -Eeuo pipefail
model="/usr/share/piper-voices/en/en_GB/jenny_dioco/medium/en_GB-jenny_dioco-medium.onnx"
piper-tts --model "$model" --output-raw 2> /dev/null | aplay -r 22050 -f S16_LE -t raw - 2> /dev/null
Make it worse ¶
If the test fails once every 2-3 minutes that’s not ideal, but workable. If instead you get one failure every hour or worse, this feedback loop is going to be annoyingly slow.
One strategy is to try and make things worse to make it fail more often. Some ways to do that:
- Inject errors randomly at locations where you expect something to fail - this is mostly applicable for areas where you have retry logic in case errors do happen.
- Add sleep statements to increase the windows where races could happen
- Increase or lower the number of threads if it’s using threading.
- Pin CPUs using
taskset
, for example:
taskset -c 0,1 ./mvnw test
- Throttle CPU using
systemd-run
:
systemd-run --user --collect --pty --same-dir -p CPUQuota=200% ./mvnw test
- Throttle disk access:
systemd-run --user --collect --pty --same-dir -p IOReadBandwidthMax=20M ./mvnw test
(You need to setup user delegation for the resource constraints to work)
Logging ¶
Given that stepping through your code with a debugger will likely stop you from hitting the race conditions you’re left with debugging via logs.
There’s not much advice here. Add as much logging as you need to get an idea of what’s happening, while avoiding drowning in too much information. This takes practice, intuition and a good understanding of the code base.
Wrap up ¶
That’s all so far.
If you have any other strategies I’d be happy to learn about them. I’m especially interested in stories about post-mortem debugging - ideally for Java. Maybe there’s something similar to projects like rr or coredumpy?
I’ve also recently learned of fray but haven’t used it yet.