Dealing with flaky tests

Thursday, June 26, 2025

In my work I’m often dealing with flaky tests due to race conditions. Here are a few things I’ve picked up along the way that can help to deal with them.

Identify ¶

To fix flaky tests you first have to know about them. One approach to find them that I’ve found to work well is to:

Always run all tests on each PR
Only merge if green
Run all tests periodically on the main branch and track the failures. Any failure on the main branch is per definition a test that’s flaky (assuming you pin dependencies).

Reproduce ¶

If you’re dealing with a race condition it most likely won’t be obvious from the failure alone what’s going wrong and you need more information. You’ll also need a way to reasonable confirm that you’ve fixed the issue. That means it helps a lot if you can reproduce the issue.

One clumsy way to reproduce the issue is to run the test a lot. I have a little script like this:

#!/usr/bin/env zsh
set -Eeuo pipefail

start_time=$(date +%s)
"$@"
while [ $? -eq 0 ]; do
    duration=$(($(date +%s) - $start_time))
    echo -n -e "\033]0;$(( $duration / 60 )) min $(( $duration % 60 )) sec: $@\007"
    "$@"
done

And I often compose it with timeout and use it like this:

timeout 60m until-error ./mvnw test [...]; echo "test run finished" | speak

speak being a wrapper around piper-tts:

#!/usr/bin/env bash
set -Eeuo pipefail

model="/usr/share/piper-voices/en/en_GB/jenny_dioco/medium/en_GB-jenny_dioco-medium.onnx"
piper-tts --model "$model" --output-raw 2> /dev/null | aplay -r 22050 -f S16_LE -t raw - 2> /dev/null

Make it worse ¶

If the test fails once every 2-3 minutes that’s not ideal, but workable. If instead you get one failure every hour or worse, this feedback loop is going to be annoyingly slow.

One strategy is to try and make things worse to make it fail more often. Some ways to do that:

Inject errors randomly at locations where you expect something to fail - this is mostly applicable for areas where you have retry logic in case errors do happen.
Add sleep statements to increase the windows where races could happen
Increase or lower the number of threads if it’s using threading.
Pin CPUs using taskset, for example:

taskset -c 0,1 ./mvnw test

Throttle CPU using systemd-run:

systemd-run --user --collect --pty --same-dir -p CPUQuota=200% ./mvnw test

Throttle disk access:

systemd-run --user --collect --pty --same-dir -p IOReadBandwidthMax=20M ./mvnw test

(You need to setup user delegation for the resource constraints to work)

Logging ¶

Given that stepping through your code with a debugger will likely stop you from hitting the race conditions you’re left with debugging via logs.

There’s not much advice here. Add as much logging as you need to get an idea of what’s happening, while avoiding drowning in too much information. This takes practice, intuition and a good understanding of the code base.

Wrap up ¶

That’s all so far.

If you have any other strategies I’d be happy to learn about them. I’m especially interested in stories about post-mortem debugging - ideally for Java. Maybe there’s something similar to projects like rr or coredumpy?

I’ve also recently learned of fray but haven’t used it yet.