Stabilising performance after a major kernel revision

A topic related to upstreaming patches on kernel forks related to embedded platforms is currently being discussed for Kernel Summit 2016. This is an age-old topic related to whether it is better to work upstream and backport or apply patches to a product-specific kernel and worry about forward-porting later. The points being raised have not changed over the years and still comes down to getting something out the door quickly versus long-term maintenance overhead. I’m not directly affected so had nothing new to add to the thread.

However, I’ve had recent experience stabilising the performance of an upstream kernel after a major kernel revision in the context of a distribution kernel. The kernel in question follows an upstream-first-and-then-backport policy with very rare exceptions. The backports are almost always related to hardware enablement but performance-related patches are also cherry-picked which is what my primary concern as Performance Team Lead is. The difficulty we face is that the distribution kernel is faster than the baseline upstream stable kernel is and faster than the mainline kernel we rebase to for a new release. There are usually multiple root causes and because of the cherry-picking, it’s not a simple case of bisecting.

Performance is always workload and hardware specific so I’m not going to get into the performance figures and profiles used to make decisions but the patches in question are on a public git tree if someone was sufficiently motivated. There may be an attempt to update the -stable kernel involved without a guarantee it’ll be picked up. Right now, it’s still a work in progress but this list gives an idea of the number of patches involved;

  • 6 months stabilisation effort spread across 8 people
  • 89 patches related to performance that could be in -stable
  • More patches already merged to -stable
  • +5 patches reducing debugging overhead
  • +4 patches related to vmstat handling
  • +2 patches related to workqueues
  • +8 patches related to Transparent Huge Page overhead
  • +3 patches related to NUMA balancing
  • +30 patches related to scheduler
  • +70 patches related to locking
  • Over 4000 patches related to feature and hardware enablement

This is an incomplete list and it’s a single case that may or may not apply to other people and products. I do have anecdotal evidence that other companies carry far fewer patches when stabilising performance but in many cases, those same companies have a fixed set of well-known workloads where as this is a distribution kernel for general use.

This is unrelated to the difficulties embedded vendors have when shipping a product but lets just say that I have a certain degree of sympathy when a major kernel revision is required. That said, my experience suggests that the effort required to stabilise a major release periodically is lower than carrying ever-increasing numbers of backports that get harder and harder to backport.

Continual testing of mainline kernels

It is not widely known that the SUSE Performance team runs continual testing of mainline kernels and collects data on machines that would be otherwise idle. Testing is a potential topic for Kernel Summit 2015 topic so now seems like a good a time introduce Marvin. Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases. There are four primary components Marvin of interest.

The first component is the test client which is a copy of MMTests. The use of MMTests ensures that the tests can be independently replicated and the methodology examined. The second component is Bob which is a builder that monitors git trees for new kernels to test, builds the kernel when it’s released and schedules it to be tested. In practice this monitors the SLE kernel tree continually and checks the mainline git tree once a month for new releases. Bob only builds and queues released kernels and ignores -rc kernels in mainline. The reason for this is simple — time. The full battery of tests can take up to a month to complete in come cases and it’s impractical to do that on every -rc release. There are times when a small subset of tests will be checked for a pre-release kernel but only when someone on the performance team is checking a specific series of patches and it’s urgent to get the results quickly. When tests complete, it’s Bob that generates the report. The third component is Marvin which runs on the server and one instance exists per test machine. It checks the queue, prepares the test machine and executes tests when the machine is ready. The final component is a configuration manager that is responsible for reserving machines for exclusive use, managing power, managing serial consoles and deploying distributions automatically. The inventory management does not have a specific name as it’s different depending on where Marvin is setup.

There are two installations of Marvin — one that runs in my house and a second that runs within SUSE and they have slightly different configurations. Technically Marvin supports testing on different distributions but only openSUSE and SLE are deployed. SLE kernels are tested on the corresponding SLE distribution. The Marvin instance in my house tests kernels 3.0 up to 3.12 on openSUSE 13.1 and then kernels 3.12 up to current mainline on openSUSE 13.2. In the SUSE instance, SLE 11 SP3 is used as the distribution for testing kernels 3.0 up to 3.12 and openSUSE 13.2 is used for 3.12 and later kernels. The kernel configuration used corresponds to the distribution. The raw results are not publicly available but the reports generated on private servers and mirrored once a week to the following locations;

Dashboard for kernels 3.0 to 3.12 on openSUSE 13.1 running on home machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on home machines

Dashboard for kernels 3.0 to 3.12 on SLE 11 SP3 running on SUSE machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on SUSE machines

The dashboard is intended to be a very high-level view detailing if there are regressions or not in comparison to a baseline. For example, in the first report linked above, the baseline is always going to be a 3.0-based kernel. It needs a human to establish if the regression is real or if it’s an acceptable trade-off. The top section makes a guess as to where the biggest regressions might be but it’s not perfect so double check. Each test that was conducted is then listed. The name of the test corresponds to a MMTests configuration file in configs/ with extentions naming the filesystem used if that is applicable. The columns are then machines with a number which represents a performance delta. 1 means there is no difference. 1.02 would mean there is a 2% difference and the colour indicates whether it is a performance regression or gain. Green is good, red is bad, gray or white is neutral. It will automatically guess if the result is significant which is why 0.98 on one test might be a 2% performance regression in one test (red) and in the noise for another.

It is important to note that the dashboard figure is a very rough estimate that often decomposing multiple values into a single number. There is no substitute for reading the detailed report and making an assessment. It is also important to note that Marvin is not up to date and some machines have not started testing 4.1. It is known that the reports are very ugly but making it prettier has yet to climb up the list of priorities. Where possible we are instead picking a regression and doing something about it instead of making HTML pages look pretty.

The obvious question is what has been done with this data. When Marvin was first assembled, the intent was to identify and fix regressions between 2.6.32 (yes, really) and 3.12. This is one of the reasons why 3.12-stable contains so many performance related fixes. When a regression was found there are generally one of three outcomes. The first is that it gets fixed obviously. The second is that it is identified as an apparent, but not real, regression. Usually this means the kernel was buggy in an old kernel in a manner that happened to benefit a particular benchmark. Tiobench is an excellent example. On old kernels there was a bug that preserved old pages and reclaimed new pages in certain circumstances. For most workloads, this is terrible but in tiobench it means that parts of the file were cached and the IO appeared to complete faster but it was a lie. The third possible outcome is that it’s slower but it’s a tradeoff to win somewhere else and the tradeoff is acceptable. Some scheduler regressions fall under this heading where a context-switch micro-benchmark might be hurt but it’s because the scheduler is making an intelligent placement decision.

The focus on 3.12 is also why Marvin is not widely advertised within the community. It is rare that mainline developers are concerned with performance in -stable kernels unless the most recent kernel is also discussed. In some cases the most recent kernel may have the same regression but it is common to discover there is simply a different mix of problems in a recent kernel. Each problem must be identified and addressed in turn and time is spent on that instead of adding volume to LKML. Advertising the existence of Marvin wasalso postponed because some of the tests or reporting were buggy and each time I wanted to fix the problem. There are very few that are known to be problematic now but it takes a surprising amount of time to address all problems that crop up when running tests across large numbers of machines. There are still issues lurking in there but if a particularly issue is important to you then let me know and I’ll see if it can be examined faster.

An obvious question is how this compares to other performance-based automated testing such as Intel’s 0-day kernel test infrastructure. The answer is that they are complementary. The 0-day infrastructure tests every commit to quickly identify both performance gains and regressions. The tests are short-lived by necessity and are invaluable at quickly catching some classes of problems. The tests run by Marvin are much longer-lived and there is only overlap in a small number of places. The two systems are simply looking for different problems. Hence, in 2012 I was tempted to try integrating parts of what became Marvin with 0-day but ultimately it was unnecessary and there is value in both. The other system worth looking at is the results reported on Phoronix Test Suite. In that case, it’s relatively rare that the data needed to debug a problem is included in the reports which complicates matters. In a few cases I examined in detail I had problems with the testing methodology. As MMTests already supported large amounts of what I was looking for there was no benefit to discarding it and starting again with Phoronix and addressing any perceived problems there. Finally, on the site that reports the results, there is a frequent emphasis there on graphics performance or the relative performance between different hardware configurations. It is relatively rare that this is the type of comparison my team is interested in.

The next obvious question is how are recent releases performing? At this time I so not want to make a general statement as I have not examined all the data in sufficient detail and am currently developing a series aimed at one of the problems. When I work on mainline patches, it’s usually with reference to the problem I picked out after browsing through reports, targeting a particular subsystem area or in response to a bug report. I’m not applying a systematic process to identify all regressions at this point and it’s still considered a manual process to determine if a reported regression is real, apparent or a tradeoff. When a real regression is found then Marvin can optionally conduct an automated bisection but that process is usually “invisible” and is only reported indirectly in a changelog if the regression gets fixed.

So what’s next? The first is that more attention is going to be paid to recent kernels and checking if regressions were introduced since 3.12 that need addressing. The second is identifying any bottlenecks that exist in mainline that are not regressions but still should be addressed. The last of course if coverage. The first generation of Marvin focused on some common workloads and for a long time it was very useful. The number of problems it is finding is now declining so other workloads will be added over time. Each time a new configuration is added, Marvin will go back through all the old kernels and collect data. This is probably not a task that will ever finish. There always will be some new issue be it due to a hardware change, a new class of workload as the usage of computers evolve or a modification that fixed one problem and introduced another. Fun times!

One week to go for LSF/MM

Linux Storage Filesystem and Memory Management Summit 2014 (LSF/MM 2014) is almost here with just under a week to go. The schedule is almost filled but there are still some gaps for last-minute topics if something crops up or attendees decide feel that something is missing. There will be no printed schedule to accommodate this but there should be a qcode link to the live schedule on conference badges. As always, attendees are strongly encouraged to actively participate and post on the lists in advance if they want changes. The range of topics is very large although personally I’m very much looking forward to the database/kernel topics and the large block interfaces because of my personal bias towards VM and VM/FS topics.

For those that will not be attending LSF/MM there will be the usual writeups and some follow-ups at Collaboration Summit. There will be a panel consisting of at least one person from each track on the Thursday morning to cover any topic missed by the writeup or if attendees have additional comments to make. The database topic at LSF/MM has a limited number of attendees from the database community and there was more interest than anticipated so we’ll also have a database/kernel interlock topic at Collaboration Summit on Thursday at 3pm. There was an announcement with a summary of the discussion to date and I’m hoping as many database and kernel community developers as possible will attend.

It’s looking like this will be a great LSF/MM which continues to grow strongly from its humble origins when Nick Piggin organised the first FS/MM meeting (that I’m aware of at least). It’s appreciated that companies continue to fund developers to attend LSF/MM which I personally consider to be the most important event that I attend each year. As always, while companies and foundations fund people to attend it would not be possible to run the event with the support of the sponsors as well. We have some great sponsors this year to help cover our costs and a few more have come on board since I wrote the first update. Facebook, NetApp and Western Digital are showing great support by sponsoring us at the platinum level. Google, IBM, Samsung and Seagate supporting us as Gold Sponsors and we appreciate the additional support of our Silver Sponsors, Dataera, Pure Storage and SUSE. Thanks to all who are going to make LSF/MM 2014 an exceptional event.

Basic workload analysis

I recently ran into a problem where ebizzy performance was far lower than expected and bisection was identifying patches that did not make much sense. I suspected there was something wrong with the
machine but I was not sure what. I’m not going into the details of what patches were involved or why it mattered but I was told by some people that it is not always obvious how to start analysing a workload.

The problem is not lack of tools. There are a lot different ways that a workload can be analysed and which one is needed depends on the situation. This is not the only way ebizzy could have been analysed
to reach the conclusions I reached and it is not even necessarily the best way. This is just what I did this time around as a case study.

This is how I was running ebizzy on a machine with 48 logical CPUs.

        # ./ebizzy -S 300 -t 48

The workload is expected to be CPU intensive so in a second window I checked what the utilisation looked like.

        Linux 3.12.0-vanilla (compass)  28/01/14        x86_64        (48 CPU)

        12:11:32     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
        12:11:33     all    1.01    0.00   11.24    0.00    0.00    0.00    0.00    0.00   87.75
        12:11:34     all    0.76    0.00   11.25    0.00    0.00    0.00    0.00    0.00   87.99
        12:11:35     all    0.86    0.00   10.71    0.00    0.00    0.00    0.00    0.00   88.43
        12:11:36     all    0.78    0.00    9.84    0.00    0.00    0.00    0.00    0.00   89.38
        12:11:37     all    0.80    0.00   11.53    0.00    0.00    0.00    0.00    0.00   87.67
        12:11:38     all    0.86    0.00   12.25    0.00    0.00    0.00    0.00    0.00   86.89
        12:11:39     all    0.86    0.00   10.35    0.00    0.00    0.00    0.00    0.00   88.78

That is extremely high idle and system CPU time for a workload that is allegedly CPU intensive. Due to the high system CPU usage it made sense to see where in the kernel it was right now. Using ps -eL I found the thread pids and

compass:~ # cat /proc/4112/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff
compass:~ # cat /proc/4113/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff
compass:~ # cat /proc/4114/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff

There is a definite pattern here. Threads are blocked on a userspace lock or locks of some description. There are tools that identify hot locks in userspace but it never hurts to see if it is something obvious first. gdb can be used to attach to a pid to see. It’s disruptive but in this case we do not really care so

        compass:~ # gdb -p 4114
        (gdb) bt
        #0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
        #1  0x00007f2332d4a598 in _L_lock_8036 () at malloc.c:5143
        #2  0x00007f2332d46f07 in malloc_check (sz=139789156697632, caller=) at hooks.c:260
        #3  0x0000000000401320 in alloc_mem (size=524288) at ebizzy.c:254
        #4  0x000000000040175a in search_mem () at ebizzy.c:402
        #5  0x00000000004018ff in thread_run (arg=0xa650ec) at ebizzy.c:448
        #6  0x00007f23330820db in start_thread (arg=0x7f2323124700) at pthread_create.c:309
        #7  0x00007f2332db290d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I had the debugging symbols installed and compiled ebizzy with debugging symbols so I’d be able to get useful backtraces like this. The malloc_check set of alarm bells because in this day and age malloc is not going to be using a global lock unless something is screwed. As I was running OpenSUSE, it was easy to check what that was with

        compass:~ # sed -n '255,265p' /usr/src/debug/glibc-2.18/malloc/hooks.c
          if (sz+1 == 0) {
            __set_errno (ENOMEM);
            return NULL;
          }

          (void)mutex_lock(&main_arena.mutex);
          victim = (top_check() >= 0) ? _int_malloc(&main_arena, sz+1) : NULL;
          (void)mutex_unlock(&main_arena.mutex);
          return mem2mem_check(victim, sz);
        }

There is an obvious large lock right in the middle of the allocation path. A quick bit of digging around found that this is debugging code that is only enabled for malloc debugging and lo and behold

        compass:~ # echo $MALLOC_CHECK_
        3

This first problem was simple really. I had installed a beta version of openSUSE that installed the aaa_base-malloccheck package by default to catch bugs early. All the threads were sleeping contending on a process-wide lock as a result. Remove that package, start a new shell and

        compass:~ # mpstat 1
        Linux 3.12.0-vanilla (compass)  28/01/14        x86_64        (48 CPU)

        13:32:41     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
        13:32:42     all    7.04    0.00   92.96    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:43     all    7.54    0.00   92.46    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:44     all    7.04    0.00   92.96    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:45     all    7.17    0.00   92.83    0.00    0.00    0.00    0.00    0.00    0.00

Ok, the CPU is no longer idle but that system time is stupidly high. Easiest way to get a quick view of the system is with “perf top”. Optionally with -G if you want to get callgraph data although in that case I would also recommend you use –freq to reduce the sample rate to avoid excessive system disruption

        compass:~ # perf top --freq 100 -G
        -  24.99%  [kernel]            [k] down_read_trylock
           - down_read_trylock
              + 99.96% __do_page_fault
        +  19.33%  [kernel]            [k] handle_mm_fault
        +  17.38%  [kernel]            [k] up_read
        +   7.29%  [kernel]            [k] find_vma
        +   6.76%  libc-2.18.so        [.] __memcpy_sse2_unaligned
        +   3.75%  [kernel]            [k] __mem_cgroup_commit_charge
        +   3.11%  [kernel]            [k] __do_page_fault
        +   2.97%  [kernel]            [k] clear_page_c
        +   1.25%  [kernel]            [k] call_function_interrupt
        +   1.15%  [kernel]            [k] __mem_cgroup_count_vm_event
        +   1.02%  [kernel]            [k] smp_call_function_many
        +   0.86%  [kernel]            [k] _raw_spin_lock_irqsave
        +   0.84%  [kernel]            [k] get_page_from_freelist
        +   0.66%  [kernel]            [k] mem_cgroup_charge_common
        +   0.57%  [kernel]            [k] page_fault
        +   0.51%  [kernel]            [k] release_pages
        +   0.51%  [kernel]            [k] unmap_page_range
        +   0.44%  [kernel]            [k] __mem_cgroup_uncharge_common
        +   0.40%  [kernel]            [k] __pagevec_lru_add_fn
        +   0.39%  [kernel]            [k] page_add_new_anon_rmap
        +   0.38%  [kernel]            [k] generic_smp_call_function_single_interrupt
        +   0.36%  [kernel]            [k] zone_statistics
        +   0.32%  [kernel]            [k] _raw_spin_lock
        +   0.29%  [kernel]            [k] __lru_cache_add
        +   0.29%  [kernel]            [k] native_write_msr_safe
        +   0.20%  [kernel]            [k] page_remove_rmap
        +   0.20%  [kernel]            [k] flush_tlb_func

There are two interesting points to note here. First, you do not need to be a kernel expert to see that there is a lot of fault activity going on here. Most of these functions are allocating pages, clearing them, updating page tables, LRU manipulations and so on. This could indicate there is a lot of buffer allocation/free activity going on with the workload. The second is that even though the faults are parallelised, they still hit very heavy on a mutex in the fault path — almost certainly mmap_sem in this case. Even though it’s taken for read, it must be bouncing like crazy to feature that high in the profile.

Anyway, the interesting question at this point is what is causing all the page faults. Sure, it’s probably ebizzy as the machine is otherwise idle and it’s small enough to figure out from a code inspection. However, the kernel has tracing and probing infrastructure for a reason and it should be possible to be lazy about it. In kernel 3.13 there is even a dedicated tracepoint for dealing with faults although right now this is on a 3.12 kernel. The options include dynamic probing, userspace stack tracing and a host of others but in this case the easiest by far is a systemtap script to hold it all together.

        global fault_address%, fault_eip%, user_trace%
        global nr_samples

        function userspace_eip:long () %{
                THIS->__retvalue = task_pt_regs(current)->ip;
        %}

        probe vm.pagefault {
                p = pid()
                key_address = sprintf("%d:%s:%p", p, execname(), address);
                key_eip     = sprintf("%d:%s:%p", p, execname(), userspace_eip());

                fault_address[key_address]++
                fault_eip[key_eip]++
                if (fault_eip[key_eip] == 100) {
                        user_trace[key_eip] = sprint_usyms(ubacktrace())
                }
                nr_samples++
        }

        probe timer.ms(5000) {
                printf("ndata address samplesn")
                foreach (key in fault_address- limit 5) {
                        printf ("%-30s%6dn", key, fault_address[key])
                }
                printf("eip samplesn")
                foreach (key in fault_eip- limit 5) {
                        printf ("%-30s%6dn", key, fault_eip[key])
                        if (fault_eip[key] >= 100) {
                                printf ("%sn", user_trace[key])
                        }
                }
                delete fault_address
                delete fault_eip
                delete user_trace
        }

Straight-forward enough, fire it up while ebizzy is running and after a few samples we get

        compass:~ # stap -d /lib64/libc-2.18.so -d /lib64/libpthread-2.18.so -d /lib64/ld-2.18.so -d ./ebizzy -DSTP_NO_OVERLOAD -DMAXSKIPPED=1000000 -g ./top-page-faults.stp
        eip samples
        6927:ebizzy:0x7f0772253006    118733
         0x7f0772253006 : __memcpy_sse2_unaligned+0xe6/0x220 [/lib64/libc-2.18.so]
         0x4017ba : search_mem+0xf5/0x1fe [/root/ebizzy]
         0x4018ff : thread_run+0x3c/0x96 [/root/ebizzy]
         0x7f077257a0db : start_thread+0xcb/0x300 [/lib64/libpthread-2.18.so]
         0x7f07722aa90d : clone+0x6d/0x90 [/lib64/libc-2.18.so]

        6927:ebizzy:0x7f077223e669      1245
         0x7f077223e669 : _int_malloc+0xaf9/0x1360 [/lib64/libc-2.18.so]
         0x7f077223fff9 : __libc_malloc+0x69/0x170 [/lib64/libc-2.18.so]
         0x401320 : alloc_mem+0x64/0xb8 [/root/ebizzy]
         0x40175a : search_mem+0x95/0x1fe [/root/ebizzy]
         0x4018ff : thread_run+0x3c/0x96 [/root/ebizzy]
         0x7f077257a0db : start_thread+0xcb/0x300 [/lib64/libpthread-2.18.so]
         0x7f07722aa90d : clone+0x6d/0x90 [/lib64/libc-2.18.so]

        7046:nis:0x7f4dc43844b7          104
         0x7f4dc43844b7 : _int_malloc+0x947/0x1360 [/lib64/libc-2.18.so]
         0x7f4dc4385ff9 : __libc_malloc+0x69/0x170 [/lib64/libc-2.18.so]
         0x475008 : 0x475008 [/bin/bash+0x75008/0x9c000]

        7047:nis:0x7f4dc43830da           33
        7049:systemctl:0x7f28d6af813f     33

        # Translate the top addresses in ebizzy of interest
        compass:~ # addr2line -e ebizzy 0x4017ba
        ebizzy-0.3/ebizzy.c:413
        compass:~ # addr2line -e ebizzy 0x401320
        ebizzy-0.3/ebizzy.c:254
        compass:~ # addr2line -e ebizzy 0x40175a
        ebizzy-0.3/ebizzy.c:402

        compass:~ # sed -n '410,415p' /root/git-private/mmtests-test/work/testdisk/sources/ebizzy-0.3/ebizzy.c
                        else
                                memcpy(copy, src, copy_size);

                        key = rand_num(copy_size / record_size, state);

                        if (verbose > 2)
        compass:~ # sed -n '395,405p' /root/git-private/mmtests-test/work/testdisk/sources/ebizzy-0.3/ebizzy.c
                /
                  If we're doing random sizes, we need a non-zero
                  multiple of record size.
                 /
                if (random_size)
                        copy_size = (rand_num(chunk_size / record_size, state)
                                     + 1)  record_size;
                copy = alloc_mem(copy_size);

                if ( touch_pages ) {
                        touch_mem((char ) copy, copy_size);

The causes of faults are memory copies to a newly allocation short-lived buffer that gets freed immediately and all this takes place in search_mem(). It’s obvious when you have the tools to point you in the right direction. Working around the problem after that is up to you really. strace would reveal that the workload is dominated by calls to madvise as this very crude example illustrates

        compass:~ # strace -f -e mmap,madvise ./ebizzy -S 2 -t 48 2>&1 | grep ( | cut -d( -f1 | awk '{print $NF}' | sort | uniq -c
          17733 madvise
            161 mmap

Given infinite free time the options would be to alter ebizzy to be less allocator intensive to avoid this paths in the kernel altogether or modify madvise(MADV_DONTNEED) to not suck but how to do so is beyond the scope of the block.

The key is that workload analysis is not impossible if you just spend time to look for the available tools and then ask questions the tools can answer then for you.

LSF/MM 2014 so far

The planning for the Linux Storage Filesystem and Memory Management Summit 2014 is progressing nicely. There has been a solid response to the Call For Proposals sent out on December with a wide range of topics being sent to the lists. These include high-level issues such as the kernel APIs currently exposed to userspace and their suitability for efficient IO, storage technology of different types with Shingled Magnetic Recording (SMR) being of particular interest, persistent memory and how it should be supported and utilised, supporting block sizes larger than the MMU page size, filesystem layouts for new storage technologies, numerous device mapper topics, VM scalability issues, memory compression, the list goes on. I’m looking forward very much to discussing topics related to database software requirements and what the kernel currently gets right and wrong. There was an exceptional response from the PostgreSQL community when the topic was raised and it looks very promising so far. As well as discussing the issue at LSF/MM, there should be a follow-up meeting at Collaboration Summit. Details will be sent out about the Collaboration Summit meeting when I manage to book a room to hold it in. Whatever else LSF/MM may be this year, it will not be boring or stuck for topics to talk about.

I am pleased to note that there are a number of new people sending in attend and topic mails. The long-term health of the community depends on new people getting involved and breaking through any perceived barrier to entry. At least, it has been the case for some time that there is more work to do in the memory manager than there are people available to do it. It helps to know that there are new people on the way.

None of this would be possible without the people making the proposals and I’m glad that LSF/MM is looking strong again. If you have been planning on sending in an attend or topic request then there is still time to do it before the January 31st deadline. Get moving!

It would also not be possible without the continued support of sponsors without whom LSF/MM would be homeless or held in a temporary shed like the first FS/MM meeting was (cold, damp, biscuits tasted a little of despair). NetApp are continuing with Platinum support and IBM are our first Gold sponsor, thanks very much to both companies for helping us out. Other companies are expected to join us soon when the paperwork goes through and I look forward to welcoming them. If any companies are interested in sponsoring us then please feel free to get in touch with me (mgorman at suse.de) and I’ll see what I can do to help.

Here’s to a solid LSF/MM 2014!

Research reproducibility and software availability

I just finished reading the Economists article Trouble in the Lab. Nothing in the article was new to me as such but I am pleased to see such concerns being raised in mainstream publication. One reason mmtests exists is because it is important that any results I use to justify a patches inclusion into the Linux kernel can be reproduced.. There is no guarantee the results can be reproduced for a variety of reasons such as being dependant on the machine configuration but the attempt can be made and anomalies, including bugs in the test itself, reported. That aside, what amused me was the link to Nature’s methodology checklist. The intent of the checklist is to ensure that “all technical and statistical information that is crucial to an experiment’s reproducibility” are available. It’s a great idea (even if there is no guarantee data remains available forever) and point 19 covers software availability. What amused me was that reading the checklist requires proprietary software (Adobe Acrobat in this case) to read what could have been a simple PDF. Clearly the intent was to have a standard form when submitting documents for publication but it is ironic that a document about reproducibility is itself inconvenient to reproduce.

Automated kernel testing and QA (again)

Automated testing and QA came up yet again at kernel summit. This is a recurring topic discussed by various people at different summits. I think my own first time talking about it was at KS in 2007 shortly after I had started kernel development. At some summit in the past I also talked about MMTests which I use to drive the bulk of my own testing but I did not speak about it this year. This was not because I think everything is fine but because I did not see what I could change by talking about it again. This is a corrosive attitude so I will talk a bit about how I currently see things and what my own testing currently looks like.

Kernel testing does happen but no one ever has or ever will be happy with the state of it as there never will be consensus on what is a representative set of benchmarks or adequate hardware coverage. Any such exercise is doomed to miss something but the objective is to catch the bulk of the problems, particularly the obvious ones that crop up over and over again instead of being perfect. Developers allegedly do their own testing although it can be limited based on hardware availability. There are different organisations that do their own internal testing that are sometimes publicly visible such as SUSE and other distributions reporting through their various developers and their QA departments, Intel routinely test within their performance team with the work of Fengguang Wu’s on continual kernel testing being particularly visible. There are also some people that do public benchmarking such as Phoronix regardless of what people think of the data and how it is presented. This coverage is not unified and some workloads will never be adequately simulated by benchmark but in combination it means that there is a hopeful chance that regressions will be caught. Overall as a community I think we still rely on enterprise distribution and their partners doing a complete round of testing on major releases as a backstop. It’s not an ideal situation but we are never going to have enough dedicated physical, financial or human resources to do a complete job.

I continue to use MMTests for the bulk of my own testing and it’s capabilities and coverage continues to grow even though I do not advertise it any more. Over the last two years I have occasionally run tests on latest mainline kernels but it was very sporadic. Basically, if I was going away for a week travelling and thought of it then I would queue up tests for recent mainline kernels. If I had some free time later then I may look through the results. I was never focused on catching regressions before mainline releases and I probably never will due to the amount of time it consumes and I’m not testing-orientated per-se. More often I would use the data to correlate bug reports with the closest equivalent mmtests and see could a window be identified where the regression was introduced and why. This is reactionary based on bug reports and to combat this there are times when I am otherwise idle that I would like to preemptively look for regressions. Unfortunately when this happens my test data is rarely up to date so the regression has to be reverified against the latest kernel. By the time that test completes the free time would be gone and the opportunity missed. It would be nice to always have recent data to work with.

SUSE runs Hackweeks during which developers can work on whatever they want and the last one was October 7-11th, 2013. I took the opportunity to write “melbot” (name was a joke) which is meant to do all the grunt automation work that the real Mel should be doing but never has enough time for. There are a lot of components but none of them are particularly complex. It has a number of basic responsibilities.

  • Manage remote power and serial consoles
  • Reserve, prepare and release machines from supported test grids
  • Deploy distribution images
  • Install packages
  • Build kernel rpms or from source, install and reboot to the new kernel
  • Monitor git trees for new kernels it should be testing
  • Install and configure mmtests
  • Run mmtests, log results, generate (not a very pretty) report

There is a test co-ordinator server and a number of test clients that are part of a grid where both a local grid and a grid within SUSE are supported. To watch git trees, build rpms if necessary, queue jobs and report on jobs there is a Bob The Builder script. Kernel deployment and test execution is the responsibility of melbot. Starting the whole thing going is easy and looks something like

$ screen -S bob-the-builder -d -m /srv/melbot/bob-the-builder/bob-loop.sh
$ screen -S melbot-MachinaA -d -m /srv/melbot/melbot/melbot-loop.sh MachinaA
$ screen -S melbot-MachinaB -d -m /srv/melbot/melbot/melbot-loop.sh MachinaB

Once melbot is running it’ll check the job queue, run any necessary tests and record the results. If any problems are encountered that it cannot handle automatically, including tests taking longer than expected, melbot emails me a notification.

None of the source for melbot is released because it is a giant hack and requires undocumented manual installation. I doubt it would be of use anyway as companies with enough resources probably have their own automation already. SUSE internally has automation for hardware management and test deployment that melbot reuses large chunks of. If I can crank out basic server-side automation in a week then a dedicated team can do it and probably a lot better. The key for me is that there is now a web page containing recent mainline kernel comparisons for mmtests. Each directory there corresponds to a similarly named configuration file in the top-level directory configs/ in mmtests. As I write this, the results are not fully up to date yet as Melbot has only been running 12 days on this small test grid and will take another 5-10 days to fully catch up. Once it has caught up, it’ll check for recent kernels to test on the 17th of every month and will continually update as long as it is running. As I am paying the electricity bill on this, it might not always be running!

These test machines are new as I lost most of my test grid over the last two months due to hardware failure and all my previous results were discarded. I have not looked though these results in detail as I’m not long back from kernel summit but lets look through a few now and see what sort of hilarity might be waiting. The two test machines are ivor and ivy and I’m not going to describe what type of machines they are. FWIW, they are low end machines with single disk rotary storage.

Page allocator performance (ivor)
Page allocator performance (ivy)

kernbench is a simple kernel complication benchmark. Ivor is up to 3.10-stable and is not showing any regressions. 3.10.0 was a mess but it got fixed in -stable and is not generally showing any regressions. Ivy is only up as far as 3.4.66 but it looks like elapsed time was regressing at that point when it did not during 3.4 implying that a regression may have been introduced to -stable there. Worth keeping an eye on to see what more recent kernels look like in a week or so.

aim9 is a microbench that is not very reliable but can be an indicator of greater problems. It’s not reliable as it is almost impossible to bisect with and is sensitive to a lot of factors. Regardless, ivor and ivy are both seeing a number of regressions and the system CPU time is negligible so something weird is going on. There is small amounts of IO going on the background, probably from the monitoring so it could be the interference but it seems too low to affect the type of tests that are running. Interrupts are surprisingly high in recent kernels, no idea why.

vmr-stream catches nasty regressions related to cache coloring. It is showing nothing interesting today.

page allocator microbench shows little of interest. It shows 3.0 sucked in comparison to 3.0-stable but it’s known why. 3.10 also sucked but got fixed according to ivor.

pft is a page fault microbench. Ivor shows that 3.10 once again completely sucked and while 3.0.17 was better, it’s still regressing slightly and the system CPU is higher so something screwy is going on there. Ivy looks ok currently but it has a long way to go.

So some minor things there — pft is the greatest concern to pin down why the system CPU usage is higher and if it got fixed in a later mainline kernel. If it got fixed later then there may be a case for backporting it but it would depend on who was using 3.10 longterm kernels.

Local network performance (ivor)
Local network performance (ivy)

netperf-udp is running the UDP_STREAM from netperf on localhost. Performance is completely mutilated and went to hell somewhere between 3.2 and 3.4 and had varying degrees of being worse ever since according to both ivor and ivy.

netperf-tcp tells a different tale. On ivor it regressed between 3.0 and 3.2 but 3.4 was not noticably worse and it was massively improved in 3.10-stable by something. This possibly indicates that the network people are paying closer attention to TCP than UDP but it could also indicate that loopback testing of network is not common as it is not usually that interesting. Ivy has not progressed far enough but looked like it saw similar regressions to ivor for a while.

tbench4 generally looks ok, not great, but not bad enough to care although it is interesting to note that 3.10.17 is more variable than earlier kernels on ivor at least.

Of that, the netperf-udp performance would be of greatest concern. Given infinite free time and if that machine was free it is fortunately trivial to bisect these problems at least. There is a bisection script that uses the same infrastructure to bisect, build install and test kernels. It just has to be tuned to pick the value that is “bad”. If the results for netperf-udp are still bad when 3.12 tests complete then I’ll bisect what happened in 3.4 and report accordingly. I’m guessing that it’ll be dropped as loopback is just not the common case.

There are probably a lot of surprises in there and at some point I should spend a day reading through them all, bisect any problems and file bugs as appropriate. I have no idea when I will find the time to do that. There is always the temptation that when I have that free time that I’ll extend melbot to find those bugs and bisect them for me if the class of regressions continue to be fairly obvious. By rights I should coordinate with Fengguang Wu to run many of the short-lived tests with his automation and automatically identify basic regressions that way. As always, there is no shortage of solutions, just of time to execute and maintain them.

Swapping over NBD/NFS and settling debts

A large number of patches were merged in 3.6-rc1 to support activating a swapfile on NFS and improving the reliability of swapping over a network block device. While it has been possible to activate swap against an NBD device for a long time it was not exactly a great idea. Significant memory pressure would likely cause the machine to freeze requiring a reboot. I expect that this particular problem is in the past but if not, then sure I’ll get cc’d on a bug report. This feature was something I pushed for a while as I knew there were a number of use cases that people cared about and it’s something that was carried in SLES for quite a while as a result.

However, this blog entry was unexpected and it gave me a swelled head, particularly coming from someone like Wouter. That beer also sounds mighty tempting but thing is, he almost certainly does not remember this but we met before, somewhere between six and eight years ago drinking beer outside one of the buildings at FOSDEM. Not exactly sure when, it was a hell of a long time ago.He bought a few spares after a trip to a bar and gave me one but left before I had the chance to buy one back. This effectively broke the Round System and for some stupid reason I remembered it pretty much every time I read a post of his on Planet Debian. For now, assuming Wouter does not file any bug reports after 3.6 comes out then maybe I can just call us even and let it go :-)

MM Tests

At LSF/MM a long time ago, there was a few people looking for an MM-orientated set of benchmarks. At the time I talked about some of the benchmarks I used and explained that I had some scripts lying around. I was meant to release a collection quickly but until a few months ago, it was a very low priority. Recently I needed a testing framework so finally put something together that would last longer than a single patch series. To rattle out problems related to it, I ran it against a number of mainline kernels and recorded the results although I have not actually read through the results properly. Rather than repeating myself, the details are posted on the mailing list

My signature gets to change

It took a while and the last 12 months have been extremely packed but now I get to update my mail signature.

 

Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

As of Wednesday 19th, I’ve finished in UL as I graduated with my PhD. I took the last week off to relax hence a total lack of responsiveness to mail or IRC for those that were trying to get in touch. I expect the next few weeks to be heavily disrupted as the next long-term plan is put together but I’ll get back on track eventually.