Dealing with flaky tests

  Thursday, June 26, 2025

Table of content


In my work I’m often dealing with flaky tests due to race conditions. Here are a few things I’ve picked up along the way that can help to deal with them.

Identify

To fix flaky tests you first have to know about them. One approach to find them that I’ve found to work well is to:

Reproduce

If you’re dealing with a race condition it most likely won’t be obvious from the failure alone what’s going wrong and you need more information. You’ll also need a way to reasonable confirm that you’ve fixed the issue. That means it helps a lot if you can reproduce the issue.

One clumsy way to reproduce the issue is to run the test a lot. I have a little script like this:

#!/usr/bin/env zsh
set -Eeuo pipefail

start_time=$(date +%s)
"$@"
while [ $? -eq 0 ]; do
    duration=$(($(date +%s) - $start_time))
    echo -n -e "\033]0;$(( $duration / 60 )) min $(( $duration % 60 )) sec: $@\007"
    "$@"
done

And I often compose it with timeout and use it like this:

timeout 60m until-error ./mvnw test [...]; echo "test run finished" | speak

speak being a wrapper around piper-tts:

#!/usr/bin/env bash
set -Eeuo pipefail

model="/usr/share/piper-voices/en/en_GB/jenny_dioco/medium/en_GB-jenny_dioco-medium.onnx"
piper-tts --model "$model" --output-raw 2> /dev/null | aplay -r 22050 -f S16_LE -t raw - 2> /dev/null

Make it worse

If the test fails once every 2-3 minutes that’s not ideal, but workable. If instead you get one failure every hour or worse, this feedback loop is going to be annoyingly slow.

One strategy is to try and make things worse to make it fail more often. Some ways to do that:

taskset -c 0,1 ./mvnw test
systemd-run --user --collect --pty --same-dir \
    -p CPUQuota=200% \
    ./mvnw test
systemd-run --user --collect --pty --same-dir \
    -p IOReadBandwidthMax=20M \
    ./mvnw test

(You need to setup user delegation for the resource constraints to work)

In addition to those, running the test scenario parallel in dedicated workspaces can help. The git worktree functionality is useful for that.

Logging

Given that stepping through your code with a debugger will likely stop you from hitting the race conditions you’re left with debugging via logs.

There’s not much advice here. Add as much logging as you need to get an idea of what’s happening, while avoiding drowning in too much information. This takes practice, intuition and a good understanding of the code base.

Debugger

A scenario where a debugger might be useful is if the failure case involves a unique code path that’s otherwise not hit. In that case you can set a breakpoint on it and know it’s only hit once the race happens.

You’ll likely still need a way to rerun the test case many times. Some test frameworks offer tools for that. randomizedtesting for example includes a @Repeat annotation. Another option is to combine the until-error script from above with a test frameworks ability to drop into a debugger.

For example, maven via the surefire plugin can use the jdwp agent to allow you to debug tests. Usually that’s used like this:

mvn -Dmaven.surefire.debug test

This causes it to pause before running the tests, awaiting a remote debugger on port 5005.

One issue here is that you’d need to manually attach on each retry if you combine this with until-error. But there’s a way around that. The Java debugging agent has a launch= option to execute an arbitrary command once it’s ready and waiting for a remote debugger. (docs).

You can use it like this:

until-error mvn test \
    -Dmaven.surefire.debug='-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:5005,launch=mvn-attach'

In this example mvn-attach is a small Lua script:

#!/usr/bin/env -S nvim -l
-- vim: ft=lua

local parent = assert(os.getenv("NVIM"), "mvn-attach only works if $NVIM is set")
local conn = vim.fn.sockconnect("pipe", parent, { rpc = true })
vim.fn.rpcrequest(conn, "nvim_exec_lua", [[require("dap").run(...)]], {{
  name = "mvn-attach",
  type = "java",
  request = "attach",
  hostName = "127.0.0.1",
  port = 5005,
  cwd = "${workspaceFolder}/app",
  projectName = "crate-app",
  timeout = 60000,
}

It communicates with a Neovim instance to start a debug session. I wrote about this approach in No-Config Python debugging using Neovim

Wrap up

That’s all so far.

If you have any other strategies I’d be happy to learn about them. I’m especially interested in stories about post-mortem debugging - ideally for Java. Maybe there’s something similar to projects like rr or coredumpy?

I’ve also recently learned of fray but haven’t used it yet.