Isolation is Stability in Unit Tests
Unstable tests lead to a bad time for the people that rely upon them. An intermittent failing test can cause a big loss of effort on a team long-term. They can lead to cases where a test runs fine for one person 90% of the time while another person has a much higher chance of failure. Or pull requests randomly being rejected until the tests are re-ran.
The first step to fixing a problem is admitting you have one. Once a test is observed as a failure that has previously been successful, it should immediately raise concern.
Understanding why it fails
Intermittent failure can come from a few sources. Once an honest code flow problem is ruled out, the test can be confirmed as flaky. Now the question at hand is why is it flaky.
When reviewing the code again ask these questions as you go:
- Is anything network dependent that is not being faked?
- Are there any asynchronous actions that are not being waited on?
- Could a lifecycle event be triggered that causes un-intended operations to happen?
- Is the code you have actually deterministic? (This one is very tricky to spot.)
- If your test runner allows for concurrent operation, is there any global state that is shared among your app?
Those starting point questions can help focus each pass of the code to try and spot the problem. Do not attempt to look for everything at once. There are far too many things to consider, even beyond that list.
Addressing Network Access
Retrieving data from a network is a critical task of web applications. It can be difficult to decide where exactly to handle the test data. Do you stub out the browser’s API that does the request? Or do you have a layer that handles all requests that you can stub that out to only return the data you want?
Record and playback
In ages past, there were tools (and probably still are) that do a recording of network traffic once. Then upon re-executing the tests, the saved data would be swapped in for the request response. That way your first run of a test that was successful would be the data until refreshed.
This methodology is brittle and leads to a bunch of needless data existing in the codebase. The end result essentially ends up saying “The data in X file is what it is.” Rather than testing the actual functionality of the system. Especially when you want to also test a bit of variation as time goes on to try and find genuine issues as well.
This is one practice I strongly discourage in most cases. It can be useful in the right places, but on average it doesn’t provide much value.
The best method to address these scenarios is to fake
the data at some layer of the system. I generally try to do
the furthest out part that connects to the network. Like
fetch requests. This way as much of the system runs as
possible and gets tested. In some cases you may
have a single layer in the system made to do requests
and manipulate the incoming data. If that is the case,
mocking that and forgoing stubbing browser APIs at all
is valid as well.
Fake data allows you to generate new data with every test. So long as the data you need is within the acceptable limits of the system, it’s fine. In fact, it could be slightly better since you could also use it as a cheap way of doing tolerance testing. Seeing what happen if you start throwing unexpected data around. Perhaps even observe a failure from characters in a string being present that should be valid but sometimes aren’t (regex -stare-).
Determinism and Asynchronous Actions
Promises being introduced to the front-end has been a
major achievment of engines. With the introduction
async/await syntax, a lot of confusion was introduced
as well. Asynchronous tasks are always tricky to get the
timing correct. I generally always try to look for ways to
observe known completion rather than waiting for
some time and hoping it is done.
On awaiting one of the biggest issues I have seen, far too often, is giving a promise constructor an async function. This leads to a non-deterministic operation, as it is not specified how engines should treat such a thing. Therefore, no one can truly say what will happen every time the code executes.
It is important that all code be as deterministic as possible. That is the only way a human can sanely look at it and comprehend what the expectations are.
Running many tests at the same time is a great way to get some speed out of large test suites. There is one major pitfall if the concurrency method is not isolating all tests. Global state that may be used could leak between them.
Imagine you are using local storage in some tests. Ideally, you would fake it and not even use the browsers local storage. In a rush there is some code introduced that isn’t faking access. This then becomes a common pattern copied to other areas as new features are built. There comes to be tests that are trying to modify the same keys around the system.
The test suite then is realized to be a slow one, running a few thousand tests over time in 15 minutes. In an attempt to speed things up, concurrency is added. This then can have a suite that was working perfectly well before randomly fail. Sometimes a set of suites are ran together that conflict, while other times they do not.
This can happen particularly when the concurrency is ran by multiple windows or tabs of the same browser profile running. When doing concurrent front-end tests, you want to ensure that each task of operation is fully isolated from all the others.
Flaky tests are detrimental to people running the tests and whomever is paying for the resources to run them. Now you are equipped with at least some starting points to begin identifying and resolving the problem as they appear. Remember that these issues are rarely easy to spot even when the impact area is identified. If it helps, make some diagrams of what the expectations are then diagram out the actual execution path. That excerise may help identify specific areas of concern with timing.
Now go make your test suites more reliable. Your future self and peers will thank you.