Continual testing of mainline kernels

It is not widely known that the SUSE Performance team runs continual testing of mainline kernels and collects data on machines that would be otherwise idle. Testing is a potential topic for Kernel Summit 2015 topic so now seems like a good a time introduce Marvin. Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases. There are four primary components Marvin of interest.

The first component is the test client which is a copy of MMTests. The use of MMTests ensures that the tests can be independently replicated and the methodology examined. The second component is Bob which is a builder that monitors git trees for new kernels to test, builds the kernel when it’s released and schedules it to be tested. In practice this monitors the SLE kernel tree continually and checks the mainline git tree once a month for new releases. Bob only builds and queues released kernels and ignores -rc kernels in mainline. The reason for this is simple — time. The full battery of tests can take up to a month to complete in come cases and it’s impractical to do that on every -rc release. There are times when a small subset of tests will be checked for a pre-release kernel but only when someone on the performance team is checking a specific series of patches and it’s urgent to get the results quickly. When tests complete, it’s Bob that generates the report. The third component is Marvin which runs on the server and one instance exists per test machine. It checks the queue, prepares the test machine and executes tests when the machine is ready. The final component is a configuration manager that is responsible for reserving machines for exclusive use, managing power, managing serial consoles and deploying distributions automatically. The inventory management does not have a specific name as it’s different depending on where Marvin is setup.

There are two installations of Marvin — one that runs in my house and a second that runs within SUSE and they have slightly different configurations. Technically Marvin supports testing on different distributions but only openSUSE and SLE are deployed. SLE kernels are tested on the corresponding SLE distribution. The Marvin instance in my house tests kernels 3.0 up to 3.12 on openSUSE 13.1 and then kernels 3.12 up to current mainline on openSUSE 13.2. In the SUSE instance, SLE 11 SP3 is used as the distribution for testing kernels 3.0 up to 3.12 and openSUSE 13.2 is used for 3.12 and later kernels. The kernel configuration used corresponds to the distribution. The raw results are not publicly available but the reports generated on private servers and mirrored once a week to the following locations;

Dashboard for kernels 3.0 to 3.12 on openSUSE 13.1 running on home machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on home machines

Dashboard for kernels 3.0 to 3.12 on SLE 11 SP3 running on SUSE machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on SUSE machines

The dashboard is intended to be a very high-level view detailing if there are regressions or not in comparison to a baseline. For example, in the first report linked above, the baseline is always going to be a 3.0-based kernel. It needs a human to establish if the regression is real or if it’s an acceptable trade-off. The top section makes a guess as to where the biggest regressions might be but it’s not perfect so double check. Each test that was conducted is then listed. The name of the test corresponds to a MMTests configuration file in configs/ with extentions naming the filesystem used if that is applicable. The columns are then machines with a number which represents a performance delta. 1 means there is no difference. 1.02 would mean there is a 2% difference and the colour indicates whether it is a performance regression or gain. Green is good, red is bad, gray or white is neutral. It will automatically guess if the result is significant which is why 0.98 on one test might be a 2% performance regression in one test (red) and in the noise for another.

It is important to note that the dashboard figure is a very rough estimate that often decomposing multiple values into a single number. There is no substitute for reading the detailed report and making an assessment. It is also important to note that Marvin is not up to date and some machines have not started testing 4.1. It is known that the reports are very ugly but making it prettier has yet to climb up the list of priorities. Where possible we are instead picking a regression and doing something about it instead of making HTML pages look pretty.

The obvious question is what has been done with this data. When Marvin was first assembled, the intent was to identify and fix regressions between 2.6.32 (yes, really) and 3.12. This is one of the reasons why 3.12-stable contains so many performance related fixes. When a regression was found there are generally one of three outcomes. The first is that it gets fixed obviously. The second is that it is identified as an apparent, but not real, regression. Usually this means the kernel was buggy in an old kernel in a manner that happened to benefit a particular benchmark. Tiobench is an excellent example. On old kernels there was a bug that preserved old pages and reclaimed new pages in certain circumstances. For most workloads, this is terrible but in tiobench it means that parts of the file were cached and the IO appeared to complete faster but it was a lie. The third possible outcome is that it’s slower but it’s a tradeoff to win somewhere else and the tradeoff is acceptable. Some scheduler regressions fall under this heading where a context-switch micro-benchmark might be hurt but it’s because the scheduler is making an intelligent placement decision.

The focus on 3.12 is also why Marvin is not widely advertised within the community. It is rare that mainline developers are concerned with performance in -stable kernels unless the most recent kernel is also discussed. In some cases the most recent kernel may have the same regression but it is common to discover there is simply a different mix of problems in a recent kernel. Each problem must be identified and addressed in turn and time is spent on that instead of adding volume to LKML. Advertising the existence of Marvin wasalso postponed because some of the tests or reporting were buggy and each time I wanted to fix the problem. There are very few that are known to be problematic now but it takes a surprising amount of time to address all problems that crop up when running tests across large numbers of machines. There are still issues lurking in there but if a particularly issue is important to you then let me know and I’ll see if it can be examined faster.

An obvious question is how this compares to other performance-based automated testing such as Intel’s 0-day kernel test infrastructure. The answer is that they are complementary. The 0-day infrastructure tests every commit to quickly identify both performance gains and regressions. The tests are short-lived by necessity and are invaluable at quickly catching some classes of problems. The tests run by Marvin are much longer-lived and there is only overlap in a small number of places. The two systems are simply looking for different problems. Hence, in 2012 I was tempted to try integrating parts of what became Marvin with 0-day but ultimately it was unnecessary and there is value in both. The other system worth looking at is the results reported on Phoronix Test Suite. In that case, it’s relatively rare that the data needed to debug a problem is included in the reports which complicates matters. In a few cases I examined in detail I had problems with the testing methodology. As MMTests already supported large amounts of what I was looking for there was no benefit to discarding it and starting again with Phoronix and addressing any perceived problems there. Finally, on the site that reports the results, there is a frequent emphasis there on graphics performance or the relative performance between different hardware configurations. It is relatively rare that this is the type of comparison my team is interested in.

The next obvious question is how are recent releases performing? At this time I so not want to make a general statement as I have not examined all the data in sufficient detail and am currently developing a series aimed at one of the problems. When I work on mainline patches, it’s usually with reference to the problem I picked out after browsing through reports, targeting a particular subsystem area or in response to a bug report. I’m not applying a systematic process to identify all regressions at this point and it’s still considered a manual process to determine if a reported regression is real, apparent or a tradeoff. When a real regression is found then Marvin can optionally conduct an automated bisection but that process is usually “invisible” and is only reported indirectly in a changelog if the regression gets fixed.

So what’s next? The first is that more attention is going to be paid to recent kernels and checking if regressions were introduced since 3.12 that need addressing. The second is identifying any bottlenecks that exist in mainline that are not regressions but still should be addressed. The last of course if coverage. The first generation of Marvin focused on some common workloads and for a long time it was very useful. The number of problems it is finding is now declining so other workloads will be added over time. Each time a new configuration is added, Marvin will go back through all the old kernels and collect data. This is probably not a task that will ever finish. There always will be some new issue be it due to a hardware change, a new class of workload as the usage of computers evolve or a modification that fixed one problem and introduced another. Fun times!