When working on a larger project, you may struggle with the problem of an increasingly growing set of tests, which over time begins to perform slower on your continuous integration (CI) server. I had this problem while working on a project in Ruby on Rails, where RSpec tests on CircleCI took about 15 minutes.
As it was bothering me, I decided to do something about it, which resulted in building an open-source Knapsack Ruby gem library (the name derives from the knapsack problem), which deals with distributing tests between parallel CI servers. In this article, you will learn about two approaches to split tests on parallel continuous integration servers - static and dynamic.
If you have in your project a test suite that takes to execute on a CI server a dozen or so minutes, or maybe even a few hours, you know how inconvenient this is for programmers. When you are working on some new feature and pushing a new git commit into the repository, you have to wait a long time for your CI server until it executes CI build.
Waiting a few minutes or an hour is delaying the feedback you can get from the CI server about tests that may have not been completed (red tests). After all, we all want to get information about whether our CI build is green or red as soon as possible so that the work of programmers is not blocked.
Problem with running parallel tests on the CI server
To speed up the execution of the CI build, you can use parallelism on the CI server, i.e. launching several parallel CI machines (CI containers, e.g. in Docker), where each parallel server will perform a part of the test set. However, there is a problem as to which tests should be run on which servers (CI nodes) so that their distribution is fairly even and you don’t have to wait for a CI node that is a bottleneck.
Below you can see an example of a non-optimal distribution of tests on 4 CI servers, where the second server marked in red is a bottleneck, so the waiting time for the completion of the entire CI build is up to 20 minutes.
Optimal distribution of tests on parallel CI servers
In an ideal scenario, the tests should be distributed in such a way that all parallel CI servers end operations at a similar time. In the following part, I will show how this can be achieved.
Below you can see an example of the optimal distribution of tests, where each parallel CI machine performs tests for 10 minutes, thanks to which the entire CI build lasts only 10 minutes, not 20 as in the previous example.
Static split of tests in a deterministic way
One way to determine how to divide tests between parallel machines on a CI server so that each server completes tests at a similar time is to use the measured runtime of the files in the test suite. This was the first approach I implemented in Knapsack Ruby gem.
After measuring the test execution time, we can assign individual test files between parallel CI servers to make sure that the CI build does not have a bottleneck.
With the help of the knapsack library, you can run tests for many test runners in Ruby, such as RSpec, Minitest, Cucumber, Spinach, and Turnip. Using test runtime, Knapsack gem can build a list of tests to be performed on a specific CI node.
I improved this way of dividing tests by measuring test files timing per git commit and branches. In the below video I show how Regular Mode a static split of tests in a deterministic way works in Knapsack Pro. In the next section, you will learn about some of the edge cases of this approach and how to solve it.
Problem with the static split of tests
While collecting information from users, I found out that the distribution of tests in a static way is not always a good solution. Sometimes some tests have a random execution time, which depends, for example, on how busy the CI server is or on the fact that the test does not pass due to a software error, quitting work faster than usual, etc.
The problem also grows depending on what CI server you use. Does each of the parallel CI machines have similar performance or does it share resources like a CPU or RAM? Does the CI container run in a shared environment? If the CI node is overloaded, then our tests may, of course, be slower.
Besides, there will be problems with whether all parallel machines start at a similar time or not. If you have purchased a pool of parallel CI servers, someone else might be using it too, e.g. another CI build from the current project or another project from your organization.
If not all CI nodes start at the same time or the boot time of certain steps in the middle of the CI node execution can take a random time, then we would like to be able to make sure that all CI machines finish their work at a similar moment. Slow CI machines or those which started work late should do fewer tests, and those machines that have started work earlier can easily do more.
All parallel CI nodes must stop working at a similar time to avoid a bottleneck, that is, overloading the machine with tests.
Dynamic tests split
The idea is simple. We have a set of tests that are queued on the Knapsack Pro server. Individual parallel CI machines consume the queue with Knapsack Pro API until the queue is over. Thanks to this, the tests are optimally distributed among CI servers, helping you to avoid a bottleneck in the form of an overloaded (too slow) CI server. Below you can see an example:
Dynamic test suite split solves our problem with random test execution time, with slow running CI servers or with servers that are overloaded which work is slower. No matter when they start or finish work - it’s important that they don’t take too many tests to execute until they finish their current work.
See how dynamic test suite split works in Queue Mode for Knapsack Pro.
Knapsack Pro has native support for many popular CI servers. It is also an agnostic CI tool, so you can use any CI server. All you have to do is configure the Knapsack Pro command for each parallel CI server running within one CI build. Below you can see a general example of how config YAML might look for a CI server with Knapsack Pro: