Basic workload analysis

I recently ran into a problem where ebizzy performance was far lower than expected and bisection was identifying patches that did not make much sense. I suspected there was something wrong with the
machine but I was not sure what. I’m not going into the details of what patches were involved or why it mattered but I was told by some people that it is not always obvious how to start analysing a workload.

The problem is not lack of tools. There are a lot different ways that a workload can be analysed and which one is needed depends on the situation. This is not the only way ebizzy could have been analysed
to reach the conclusions I reached and it is not even necessarily the best way. This is just what I did this time around as a case study.

This is how I was running ebizzy on a machine with 48 logical CPUs.

        # ./ebizzy -S 300 -t 48

The workload is expected to be CPU intensive so in a second window I checked what the utilisation looked like.

        Linux 3.12.0-vanilla (compass)  28/01/14        x86_64        (48 CPU)

        12:11:32     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
        12:11:33     all    1.01    0.00   11.24    0.00    0.00    0.00    0.00    0.00   87.75
        12:11:34     all    0.76    0.00   11.25    0.00    0.00    0.00    0.00    0.00   87.99
        12:11:35     all    0.86    0.00   10.71    0.00    0.00    0.00    0.00    0.00   88.43
        12:11:36     all    0.78    0.00    9.84    0.00    0.00    0.00    0.00    0.00   89.38
        12:11:37     all    0.80    0.00   11.53    0.00    0.00    0.00    0.00    0.00   87.67
        12:11:38     all    0.86    0.00   12.25    0.00    0.00    0.00    0.00    0.00   86.89
        12:11:39     all    0.86    0.00   10.35    0.00    0.00    0.00    0.00    0.00   88.78

That is extremely high idle and system CPU time for a workload that is allegedly CPU intensive. Due to the high system CPU usage it made sense to see where in the kernel it was right now. Using ps -eL I found the thread pids and

compass:~ # cat /proc/4112/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff
compass:~ # cat /proc/4113/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff
compass:~ # cat /proc/4114/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff

There is a definite pattern here. Threads are blocked on a userspace lock or locks of some description. There are tools that identify hot locks in userspace but it never hurts to see if it is something obvious first. gdb can be used to attach to a pid to see. It’s disruptive but in this case we do not really care so

        compass:~ # gdb -p 4114
        (gdb) bt
        #0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
        #1  0x00007f2332d4a598 in _L_lock_8036 () at malloc.c:5143
        #2  0x00007f2332d46f07 in malloc_check (sz=139789156697632, caller=) at hooks.c:260
        #3  0x0000000000401320 in alloc_mem (size=524288) at ebizzy.c:254
        #4  0x000000000040175a in search_mem () at ebizzy.c:402
        #5  0x00000000004018ff in thread_run (arg=0xa650ec) at ebizzy.c:448
        #6  0x00007f23330820db in start_thread (arg=0x7f2323124700) at pthread_create.c:309
        #7  0x00007f2332db290d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I had the debugging symbols installed and compiled ebizzy with debugging symbols so I’d be able to get useful backtraces like this. The malloc_check set of alarm bells because in this day and age malloc is not going to be using a global lock unless something is screwed. As I was running OpenSUSE, it was easy to check what that was with

        compass:~ # sed -n '255,265p' /usr/src/debug/glibc-2.18/malloc/hooks.c
          if (sz+1 == 0) {
            __set_errno (ENOMEM);
            return NULL;
          }

          (void)mutex_lock(&main_arena.mutex);
          victim = (top_check() >= 0) ? _int_malloc(&main_arena, sz+1) : NULL;
          (void)mutex_unlock(&main_arena.mutex);
          return mem2mem_check(victim, sz);
        }

There is an obvious large lock right in the middle of the allocation path. A quick bit of digging around found that this is debugging code that is only enabled for malloc debugging and lo and behold

        compass:~ # echo $MALLOC_CHECK_
        3

This first problem was simple really. I had installed a beta version of openSUSE that installed the aaa_base-malloccheck package by default to catch bugs early. All the threads were sleeping contending on a process-wide lock as a result. Remove that package, start a new shell and

        compass:~ # mpstat 1
        Linux 3.12.0-vanilla (compass)  28/01/14        x86_64        (48 CPU)

        13:32:41     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
        13:32:42     all    7.04    0.00   92.96    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:43     all    7.54    0.00   92.46    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:44     all    7.04    0.00   92.96    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:45     all    7.17    0.00   92.83    0.00    0.00    0.00    0.00    0.00    0.00

Ok, the CPU is no longer idle but that system time is stupidly high. Easiest way to get a quick view of the system is with “perf top”. Optionally with -G if you want to get callgraph data although in that case I would also recommend you use –freq to reduce the sample rate to avoid excessive system disruption

        compass:~ # perf top --freq 100 -G
        -  24.99%  [kernel]            [k] down_read_trylock
           - down_read_trylock
              + 99.96% __do_page_fault
        +  19.33%  [kernel]            [k] handle_mm_fault
        +  17.38%  [kernel]            [k] up_read
        +   7.29%  [kernel]            [k] find_vma
        +   6.76%  libc-2.18.so        [.] __memcpy_sse2_unaligned
        +   3.75%  [kernel]            [k] __mem_cgroup_commit_charge
        +   3.11%  [kernel]            [k] __do_page_fault
        +   2.97%  [kernel]            [k] clear_page_c
        +   1.25%  [kernel]            [k] call_function_interrupt
        +   1.15%  [kernel]            [k] __mem_cgroup_count_vm_event
        +   1.02%  [kernel]            [k] smp_call_function_many
        +   0.86%  [kernel]            [k] _raw_spin_lock_irqsave
        +   0.84%  [kernel]            [k] get_page_from_freelist
        +   0.66%  [kernel]            [k] mem_cgroup_charge_common
        +   0.57%  [kernel]            [k] page_fault
        +   0.51%  [kernel]            [k] release_pages
        +   0.51%  [kernel]            [k] unmap_page_range
        +   0.44%  [kernel]            [k] __mem_cgroup_uncharge_common
        +   0.40%  [kernel]            [k] __pagevec_lru_add_fn
        +   0.39%  [kernel]            [k] page_add_new_anon_rmap
        +   0.38%  [kernel]            [k] generic_smp_call_function_single_interrupt
        +   0.36%  [kernel]            [k] zone_statistics
        +   0.32%  [kernel]            [k] _raw_spin_lock
        +   0.29%  [kernel]            [k] __lru_cache_add
        +   0.29%  [kernel]            [k] native_write_msr_safe
        +   0.20%  [kernel]            [k] page_remove_rmap
        +   0.20%  [kernel]            [k] flush_tlb_func

There are two interesting points to note here. First, you do not need to be a kernel expert to see that there is a lot of fault activity going on here. Most of these functions are allocating pages, clearing them, updating page tables, LRU manipulations and so on. This could indicate there is a lot of buffer allocation/free activity going on with the workload. The second is that even though the faults are parallelised, they still hit very heavy on a mutex in the fault path — almost certainly mmap_sem in this case. Even though it’s taken for read, it must be bouncing like crazy to feature that high in the profile.

Anyway, the interesting question at this point is what is causing all the page faults. Sure, it’s probably ebizzy as the machine is otherwise idle and it’s small enough to figure out from a code inspection. However, the kernel has tracing and probing infrastructure for a reason and it should be possible to be lazy about it. In kernel 3.13 there is even a dedicated tracepoint for dealing with faults although right now this is on a 3.12 kernel. The options include dynamic probing, userspace stack tracing and a host of others but in this case the easiest by far is a systemtap script to hold it all together.

        global fault_address%, fault_eip%, user_trace%
        global nr_samples

        function userspace_eip:long () %{
                THIS->__retvalue = task_pt_regs(current)->ip;
        %}

        probe vm.pagefault {
                p = pid()
                key_address = sprintf("%d:%s:%p", p, execname(), address);
                key_eip     = sprintf("%d:%s:%p", p, execname(), userspace_eip());

                fault_address[key_address]++
                fault_eip[key_eip]++
                if (fault_eip[key_eip] == 100) {
                        user_trace[key_eip] = sprint_usyms(ubacktrace())
                }
                nr_samples++
        }

        probe timer.ms(5000) {
                printf("ndata address samplesn")
                foreach (key in fault_address- limit 5) {
                        printf ("%-30s%6dn", key, fault_address[key])
                }
                printf("eip samplesn")
                foreach (key in fault_eip- limit 5) {
                        printf ("%-30s%6dn", key, fault_eip[key])
                        if (fault_eip[key] >= 100) {
                                printf ("%sn", user_trace[key])
                        }
                }
                delete fault_address
                delete fault_eip
                delete user_trace
        }

Straight-forward enough, fire it up while ebizzy is running and after a few samples we get

        compass:~ # stap -d /lib64/libc-2.18.so -d /lib64/libpthread-2.18.so -d /lib64/ld-2.18.so -d ./ebizzy -DSTP_NO_OVERLOAD -DMAXSKIPPED=1000000 -g ./top-page-faults.stp
        eip samples
        6927:ebizzy:0x7f0772253006    118733
         0x7f0772253006 : __memcpy_sse2_unaligned+0xe6/0x220 [/lib64/libc-2.18.so]
         0x4017ba : search_mem+0xf5/0x1fe [/root/ebizzy]
         0x4018ff : thread_run+0x3c/0x96 [/root/ebizzy]
         0x7f077257a0db : start_thread+0xcb/0x300 [/lib64/libpthread-2.18.so]
         0x7f07722aa90d : clone+0x6d/0x90 [/lib64/libc-2.18.so]

        6927:ebizzy:0x7f077223e669      1245
         0x7f077223e669 : _int_malloc+0xaf9/0x1360 [/lib64/libc-2.18.so]
         0x7f077223fff9 : __libc_malloc+0x69/0x170 [/lib64/libc-2.18.so]
         0x401320 : alloc_mem+0x64/0xb8 [/root/ebizzy]
         0x40175a : search_mem+0x95/0x1fe [/root/ebizzy]
         0x4018ff : thread_run+0x3c/0x96 [/root/ebizzy]
         0x7f077257a0db : start_thread+0xcb/0x300 [/lib64/libpthread-2.18.so]
         0x7f07722aa90d : clone+0x6d/0x90 [/lib64/libc-2.18.so]

        7046:nis:0x7f4dc43844b7          104
         0x7f4dc43844b7 : _int_malloc+0x947/0x1360 [/lib64/libc-2.18.so]
         0x7f4dc4385ff9 : __libc_malloc+0x69/0x170 [/lib64/libc-2.18.so]
         0x475008 : 0x475008 [/bin/bash+0x75008/0x9c000]

        7047:nis:0x7f4dc43830da           33
        7049:systemctl:0x7f28d6af813f     33

        # Translate the top addresses in ebizzy of interest
        compass:~ # addr2line -e ebizzy 0x4017ba
        ebizzy-0.3/ebizzy.c:413
        compass:~ # addr2line -e ebizzy 0x401320
        ebizzy-0.3/ebizzy.c:254
        compass:~ # addr2line -e ebizzy 0x40175a
        ebizzy-0.3/ebizzy.c:402

        compass:~ # sed -n '410,415p' /root/git-private/mmtests-test/work/testdisk/sources/ebizzy-0.3/ebizzy.c
                        else
                                memcpy(copy, src, copy_size);

                        key = rand_num(copy_size / record_size, state);

                        if (verbose > 2)
        compass:~ # sed -n '395,405p' /root/git-private/mmtests-test/work/testdisk/sources/ebizzy-0.3/ebizzy.c
                /
                  If we're doing random sizes, we need a non-zero
                  multiple of record size.
                 /
                if (random_size)
                        copy_size = (rand_num(chunk_size / record_size, state)
                                     + 1)  record_size;
                copy = alloc_mem(copy_size);

                if ( touch_pages ) {
                        touch_mem((char ) copy, copy_size);

The causes of faults are memory copies to a newly allocation short-lived buffer that gets freed immediately and all this takes place in search_mem(). It’s obvious when you have the tools to point you in the right direction. Working around the problem after that is up to you really. strace would reveal that the workload is dominated by calls to madvise as this very crude example illustrates

        compass:~ # strace -f -e mmap,madvise ./ebizzy -S 2 -t 48 2>&1 | grep ( | cut -d( -f1 | awk '{print $NF}' | sort | uniq -c
          17733 madvise
            161 mmap

Given infinite free time the options would be to alter ebizzy to be less allocator intensive to avoid this paths in the kernel altogether or modify madvise(MADV_DONTNEED) to not suck but how to do so is beyond the scope of the block.

The key is that workload analysis is not impossible if you just spend time to look for the available tools and then ask questions the tools can answer then for you.

LSF/MM 2014 so far

The planning for the Linux Storage Filesystem and Memory Management Summit 2014 is progressing nicely. There has been a solid response to the Call For Proposals sent out on December with a wide range of topics being sent to the lists. These include high-level issues such as the kernel APIs currently exposed to userspace and their suitability for efficient IO, storage technology of different types with Shingled Magnetic Recording (SMR) being of particular interest, persistent memory and how it should be supported and utilised, supporting block sizes larger than the MMU page size, filesystem layouts for new storage technologies, numerous device mapper topics, VM scalability issues, memory compression, the list goes on. I’m looking forward very much to discussing topics related to database software requirements and what the kernel currently gets right and wrong. There was an exceptional response from the PostgreSQL community when the topic was raised and it looks very promising so far. As well as discussing the issue at LSF/MM, there should be a follow-up meeting at Collaboration Summit. Details will be sent out about the Collaboration Summit meeting when I manage to book a room to hold it in. Whatever else LSF/MM may be this year, it will not be boring or stuck for topics to talk about.

I am pleased to note that there are a number of new people sending in attend and topic mails. The long-term health of the community depends on new people getting involved and breaking through any perceived barrier to entry. At least, it has been the case for some time that there is more work to do in the memory manager than there are people available to do it. It helps to know that there are new people on the way.

None of this would be possible without the people making the proposals and I’m glad that LSF/MM is looking strong again. If you have been planning on sending in an attend or topic request then there is still time to do it before the January 31st deadline. Get moving!

It would also not be possible without the continued support of sponsors without whom LSF/MM would be homeless or held in a temporary shed like the first FS/MM meeting was (cold, damp, biscuits tasted a little of despair). NetApp are continuing with Platinum support and IBM are our first Gold sponsor, thanks very much to both companies for helping us out. Other companies are expected to join us soon when the paperwork goes through and I look forward to welcoming them. If any companies are interested in sponsoring us then please feel free to get in touch with me (mgorman at suse.de) and I’ll see what I can do to help.

Here’s to a solid LSF/MM 2014!