Dealing with flaky tests

  Thursday, June 26, 2025

Table of content


In my work I’m often dealing with flaky tests due to race conditions. Here are a few things I’ve picked up along the way that can help to deal with them.

Identify

To fix flaky tests you first have to know about them. One approach to find them that I’ve found to work well is to:

Reproduce

If you’re dealing with a race condition it most likely won’t be obvious from the failure alone what’s going wrong and you need more information. You’ll also need a way to reasonable confirm that you’ve fixed the issue. That means it helps a lot if you can reproduce the issue.

One clumsy way to reproduce the issue is to run the test a lot. I have a little script like this:

#!/usr/bin/env zsh
set -Eeuo pipefail

start_time=$(date +%s)
"$@"
while [ $? -eq 0 ]; do
    duration=$(($(date +%s) - $start_time))
    echo -n -e "\033]0;$(( $duration / 60 )) min $(( $duration % 60 )) sec: $@\007"
    "$@"
done

And I often compose it with timeout and use it like this:

timeout 60m until-error ./mvnw test [...]; echo "test run finished" | speak

speak being a wrapper around piper-tts:

#!/usr/bin/env bash
set -Eeuo pipefail

model="/usr/share/piper-voices/en/en_GB/jenny_dioco/medium/en_GB-jenny_dioco-medium.onnx"
piper-tts --model "$model" --output-raw 2> /dev/null | aplay -r 22050 -f S16_LE -t raw - 2> /dev/null

Make it worse

If the test fails once every 2-3 minutes that’s not ideal, but workable. If instead you get one failure every hour or worse, this feedback loop is going to be annoyingly slow.

One strategy is to try and make things worse to make it fail more often. Some ways to do that:

taskset -c 0,1 ./mvnw test
systemd-run --user --collect --pty --same-dir \
    -p CPUQuota=200% \
    ./mvnw test
systemd-run --user --collect --pty --same-dir \
    -p IOReadBandwidthMax=20M \
    ./mvnw test

(You need to setup user delegation for the resource constraints to work)

In addition to those, running the test scenario parallel in dedicated workspaces can help. The git worktree functionality is useful for that.

Logging

Given that stepping through your code with a debugger will likely stop you from hitting the race conditions you’re left with debugging via logs.

There’s not much advice here. Add as much logging as you need to get an idea of what’s happening, while avoiding drowning in too much information. This takes practice, intuition and a good understanding of the code base.

Wrap up

That’s all so far.

If you have any other strategies I’d be happy to learn about them. I’m especially interested in stories about post-mortem debugging - ideally for Java. Maybe there’s something similar to projects like rr or coredumpy?

I’ve also recently learned of fray but haven’t used it yet.