Automated kernel testing and QA (again)

Automated testing and QA came up yet again at kernel summit. This is a recurring topic discussed by various people at different summits. I think my own first time talking about it was at KS in 2007 shortly after I had started kernel development. At some summit in the past I also talked about MMTests which I use to drive the bulk of my own testing but I did not speak about it this year. This was not because I think everything is fine but because I did not see what I could change by talking about it again. This is a corrosive attitude so I will talk a bit about how I currently see things and what my own testing currently looks like.

Kernel testing does happen but no one ever has or ever will be happy with the state of it as there never will be consensus on what is a representative set of benchmarks or adequate hardware coverage. Any such exercise is doomed to miss something but the objective is to catch the bulk of the problems, particularly the obvious ones that crop up over and over again instead of being perfect. Developers allegedly do their own testing although it can be limited based on hardware availability. There are different organisations that do their own internal testing that are sometimes publicly visible such as SUSE and other distributions reporting through their various developers and their QA departments, Intel routinely test within their performance team with the work of Fengguang Wu’s on continual kernel testing being particularly visible. There are also some people that do public benchmarking such as Phoronix regardless of what people think of the data and how it is presented. This coverage is not unified and some workloads will never be adequately simulated by benchmark but in combination it means that there is a hopeful chance that regressions will be caught. Overall as a community I think we still rely on enterprise distribution and their partners doing a complete round of testing on major releases as a backstop. It’s not an ideal situation but we are never going to have enough dedicated physical, financial or human resources to do a complete job.

I continue to use MMTests for the bulk of my own testing and it’s capabilities and coverage continues to grow even though I do not advertise it any more. Over the last two years I have occasionally run tests on latest mainline kernels but it was very sporadic. Basically, if I was going away for a week travelling and thought of it then I would queue up tests for recent mainline kernels. If I had some free time later then I may look through the results. I was never focused on catching regressions before mainline releases and I probably never will due to the amount of time it consumes and I’m not testing-orientated per-se. More often I would use the data to correlate bug reports with the closest equivalent mmtests and see could a window be identified where the regression was introduced and why. This is reactionary based on bug reports and to combat this there are times when I am otherwise idle that I would like to preemptively look for regressions. Unfortunately when this happens my test data is rarely up to date so the regression has to be reverified against the latest kernel. By the time that test completes the free time would be gone and the opportunity missed. It would be nice to always have recent data to work with.

SUSE runs Hackweeks during which developers can work on whatever they want and the last one was October 7-11th, 2013. I took the opportunity to write “melbot” (name was a joke) which is meant to do all the grunt automation work that the real Mel should be doing but never has enough time for. There are a lot of components but none of them are particularly complex. It has a number of basic responsibilities.

Manage remote power and serial consoles
Reserve, prepare and release machines from supported test grids
Deploy distribution images
Install packages
Build kernel rpms or from source, install and reboot to the new kernel
Monitor git trees for new kernels it should be testing
Install and configure mmtests
Run mmtests, log results, generate (not a very pretty) report

There is a test co-ordinator server and a number of test clients that are part of a grid where both a local grid and a grid within SUSE are supported. To watch git trees, build rpms if necessary, queue jobs and report on jobs there is a Bob The Builder script. Kernel deployment and test execution is the responsibility of melbot. Starting the whole thing going is easy and looks something like

$ screen -S bob-the-builder -d -m /srv/melbot/bob-the-builder/bob-loop.sh
$ screen -S melbot-MachinaA -d -m /srv/melbot/melbot/melbot-loop.sh MachinaA
$ screen -S melbot-MachinaB -d -m /srv/melbot/melbot/melbot-loop.sh MachinaB

Once melbot is running it’ll check the job queue, run any necessary tests and record the results. If any problems are encountered that it cannot handle automatically, including tests taking longer than expected, melbot emails me a notification.

None of the source for melbot is released because it is a giant hack and requires undocumented manual installation. I doubt it would be of use anyway as companies with enough resources probably have their own automation already. SUSE internally has automation for hardware management and test deployment that melbot reuses large chunks of. If I can crank out basic server-side automation in a week then a dedicated team can do it and probably a lot better. The key for me is that there is now a web page containing recent mainline kernel comparisons for mmtests. Each directory there corresponds to a similarly named configuration file in the top-level directory configs/ in mmtests. As I write this, the results are not fully up to date yet as Melbot has only been running 12 days on this small test grid and will take another 5-10 days to fully catch up. Once it has caught up, it’ll check for recent kernels to test on the 17th of every month and will continually update as long as it is running. As I am paying the electricity bill on this, it might not always be running!

These test machines are new as I lost most of my test grid over the last two months due to hardware failure and all my previous results were discarded. I have not looked though these results in detail as I’m not long back from kernel summit but lets look through a few now and see what sort of hilarity might be waiting. The two test machines are ivor and ivy and I’m not going to describe what type of machines they are. FWIW, they are low end machines with single disk rotary storage.

Page allocator performance (ivor)
Page allocator performance (ivy)

kernbench is a simple kernel complication benchmark. Ivor is up to 3.10-stable and is not showing any regressions. 3.10.0 was a mess but it got fixed in -stable and is not generally showing any regressions. Ivy is only up as far as 3.4.66 but it looks like elapsed time was regressing at that point when it did not during 3.4 implying that a regression may have been introduced to -stable there. Worth keeping an eye on to see what more recent kernels look like in a week or so.

aim9 is a microbench that is not very reliable but can be an indicator of greater problems. It’s not reliable as it is almost impossible to bisect with and is sensitive to a lot of factors. Regardless, ivor and ivy are both seeing a number of regressions and the system CPU time is negligible so something weird is going on. There is small amounts of IO going on the background, probably from the monitoring so it could be the interference but it seems too low to affect the type of tests that are running. Interrupts are surprisingly high in recent kernels, no idea why.

vmr-stream catches nasty regressions related to cache coloring. It is showing nothing interesting today.

page allocator microbench shows little of interest. It shows 3.0 sucked in comparison to 3.0-stable but it’s known why. 3.10 also sucked but got fixed according to ivor.

pft is a page fault microbench. Ivor shows that 3.10 once again completely sucked and while 3.0.17 was better, it’s still regressing slightly and the system CPU is higher so something screwy is going on there. Ivy looks ok currently but it has a long way to go.

So some minor things there — pft is the greatest concern to pin down why the system CPU usage is higher and if it got fixed in a later mainline kernel. If it got fixed later then there may be a case for backporting it but it would depend on who was using 3.10 longterm kernels.

Local network performance (ivor)
Local network performance (ivy)

netperf-udp is running the UDP_STREAM from netperf on localhost. Performance is completely mutilated and went to hell somewhere between 3.2 and 3.4 and had varying degrees of being worse ever since according to both ivor and ivy.

netperf-tcp tells a different tale. On ivor it regressed between 3.0 and 3.2 but 3.4 was not noticably worse and it was massively improved in 3.10-stable by something. This possibly indicates that the network people are paying closer attention to TCP than UDP but it could also indicate that loopback testing of network is not common as it is not usually that interesting. Ivy has not progressed far enough but looked like it saw similar regressions to ivor for a while.

tbench4 generally looks ok, not great, but not bad enough to care although it is interesting to note that 3.10.17 is more variable than earlier kernels on ivor at least.

Of that, the netperf-udp performance would be of greatest concern. Given infinite free time and if that machine was free it is fortunately trivial to bisect these problems at least. There is a bisection script that uses the same infrastructure to bisect, build install and test kernels. It just has to be tuned to pick the value that is “bad”. If the results for netperf-udp are still bad when 3.12 tests complete then I’ll bisect what happened in 3.4 and report accordingly. I’m guessing that it’ll be dropped as loopback is just not the common case.

There are probably a lot of surprises in there and at some point I should spend a day reading through them all, bisect any problems and file bugs as appropriate. I have no idea when I will find the time to do that. There is always the temptation that when I have that free time that I’ll extend melbot to find those bugs and bisect them for me if the class of regressions continue to be fairly obvious. By rights I should coordinate with Fengguang Wu to run many of the short-lived tests with his automation and automatically identify basic regressions that way. As always, there is no shortage of solutions, just of time to execute and maintain them.