Basic workload analysis

I recently ran into a problem where ebizzy performance was far lower than expected and bisection was identifying patches that did not make much sense. I suspected there was something wrong with the
machine but I was not sure what. I’m not going into the details of what patches were involved or why it mattered but I was told by some people that it is not always obvious how to start analysing a workload.

The problem is not lack of tools. There are a lot different ways that a workload can be analysed and which one is needed depends on the situation. This is not the only way ebizzy could have been analysed
to reach the conclusions I reached and it is not even necessarily the best way. This is just what I did this time around as a case study.

This is how I was running ebizzy on a machine with 48 logical CPUs.

        # ./ebizzy -S 300 -t 48

The workload is expected to be CPU intensive so in a second window I checked what the utilisation looked like.

        Linux 3.12.0-vanilla (compass)  28/01/14        x86_64        (48 CPU)

        12:11:32     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
        12:11:33     all    1.01    0.00   11.24    0.00    0.00    0.00    0.00    0.00   87.75
        12:11:34     all    0.76    0.00   11.25    0.00    0.00    0.00    0.00    0.00   87.99
        12:11:35     all    0.86    0.00   10.71    0.00    0.00    0.00    0.00    0.00   88.43
        12:11:36     all    0.78    0.00    9.84    0.00    0.00    0.00    0.00    0.00   89.38
        12:11:37     all    0.80    0.00   11.53    0.00    0.00    0.00    0.00    0.00   87.67
        12:11:38     all    0.86    0.00   12.25    0.00    0.00    0.00    0.00    0.00   86.89
        12:11:39     all    0.86    0.00   10.35    0.00    0.00    0.00    0.00    0.00   88.78

That is extremely high idle and system CPU time for a workload that is allegedly CPU intensive. Due to the high system CPU usage it made sense to see where in the kernel it was right now. Using ps -eL I found the thread pids and

compass:~ # cat /proc/4112/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff
compass:~ # cat /proc/4113/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff
compass:~ # cat /proc/4114/stack
[] futex_wait_queue_me+0xd6/0x140
[] futex_wait+0x177/0x280
[] do_futex+0xd6/0x640
[] SyS_futex+0x6c/0x150
[] system_call_fastpath+0x1a/0x1f
[] 0xffffffffffffffff

There is a definite pattern here. Threads are blocked on a userspace lock or locks of some description. There are tools that identify hot locks in userspace but it never hurts to see if it is something obvious first. gdb can be used to attach to a pid to see. It’s disruptive but in this case we do not really care so

        compass:~ # gdb -p 4114
        (gdb) bt
        #0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
        #1  0x00007f2332d4a598 in _L_lock_8036 () at malloc.c:5143
        #2  0x00007f2332d46f07 in malloc_check (sz=139789156697632, caller=) at hooks.c:260
        #3  0x0000000000401320 in alloc_mem (size=524288) at ebizzy.c:254
        #4  0x000000000040175a in search_mem () at ebizzy.c:402
        #5  0x00000000004018ff in thread_run (arg=0xa650ec) at ebizzy.c:448
        #6  0x00007f23330820db in start_thread (arg=0x7f2323124700) at pthread_create.c:309
        #7  0x00007f2332db290d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I had the debugging symbols installed and compiled ebizzy with debugging symbols so I’d be able to get useful backtraces like this. The malloc_check set of alarm bells because in this day and age malloc is not going to be using a global lock unless something is screwed. As I was running OpenSUSE, it was easy to check what that was with

        compass:~ # sed -n '255,265p' /usr/src/debug/glibc-2.18/malloc/hooks.c
          if (sz+1 == 0) {
            __set_errno (ENOMEM);
            return NULL;
          }

          (void)mutex_lock(&main_arena.mutex);
          victim = (top_check() >= 0) ? _int_malloc(&main_arena, sz+1) : NULL;
          (void)mutex_unlock(&main_arena.mutex);
          return mem2mem_check(victim, sz);
        }

There is an obvious large lock right in the middle of the allocation path. A quick bit of digging around found that this is debugging code that is only enabled for malloc debugging and lo and behold

        compass:~ # echo $MALLOC_CHECK_
        3

This first problem was simple really. I had installed a beta version of openSUSE that installed the aaa_base-malloccheck package by default to catch bugs early. All the threads were sleeping contending on a process-wide lock as a result. Remove that package, start a new shell and

        compass:~ # mpstat 1
        Linux 3.12.0-vanilla (compass)  28/01/14        x86_64        (48 CPU)

        13:32:41     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
        13:32:42     all    7.04    0.00   92.96    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:43     all    7.54    0.00   92.46    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:44     all    7.04    0.00   92.96    0.00    0.00    0.00    0.00    0.00    0.00
        13:32:45     all    7.17    0.00   92.83    0.00    0.00    0.00    0.00    0.00    0.00

Ok, the CPU is no longer idle but that system time is stupidly high. Easiest way to get a quick view of the system is with “perf top”. Optionally with -G if you want to get callgraph data although in that case I would also recommend you use –freq to reduce the sample rate to avoid excessive system disruption

        compass:~ # perf top --freq 100 -G
        -  24.99%  [kernel]            [k] down_read_trylock
           - down_read_trylock
              + 99.96% __do_page_fault
        +  19.33%  [kernel]            [k] handle_mm_fault
        +  17.38%  [kernel]            [k] up_read
        +   7.29%  [kernel]            [k] find_vma
        +   6.76%  libc-2.18.so        [.] __memcpy_sse2_unaligned
        +   3.75%  [kernel]            [k] __mem_cgroup_commit_charge
        +   3.11%  [kernel]            [k] __do_page_fault
        +   2.97%  [kernel]            [k] clear_page_c
        +   1.25%  [kernel]            [k] call_function_interrupt
        +   1.15%  [kernel]            [k] __mem_cgroup_count_vm_event
        +   1.02%  [kernel]            [k] smp_call_function_many
        +   0.86%  [kernel]            [k] _raw_spin_lock_irqsave
        +   0.84%  [kernel]            [k] get_page_from_freelist
        +   0.66%  [kernel]            [k] mem_cgroup_charge_common
        +   0.57%  [kernel]            [k] page_fault
        +   0.51%  [kernel]            [k] release_pages
        +   0.51%  [kernel]            [k] unmap_page_range
        +   0.44%  [kernel]            [k] __mem_cgroup_uncharge_common
        +   0.40%  [kernel]            [k] __pagevec_lru_add_fn
        +   0.39%  [kernel]            [k] page_add_new_anon_rmap
        +   0.38%  [kernel]            [k] generic_smp_call_function_single_interrupt
        +   0.36%  [kernel]            [k] zone_statistics
        +   0.32%  [kernel]            [k] _raw_spin_lock
        +   0.29%  [kernel]            [k] __lru_cache_add
        +   0.29%  [kernel]            [k] native_write_msr_safe
        +   0.20%  [kernel]            [k] page_remove_rmap
        +   0.20%  [kernel]            [k] flush_tlb_func

There are two interesting points to note here. First, you do not need to be a kernel expert to see that there is a lot of fault activity going on here. Most of these functions are allocating pages, clearing them, updating page tables, LRU manipulations and so on. This could indicate there is a lot of buffer allocation/free activity going on with the workload. The second is that even though the faults are parallelised, they still hit very heavy on a mutex in the fault path — almost certainly mmap_sem in this case. Even though it’s taken for read, it must be bouncing like crazy to feature that high in the profile.

Anyway, the interesting question at this point is what is causing all the page faults. Sure, it’s probably ebizzy as the machine is otherwise idle and it’s small enough to figure out from a code inspection. However, the kernel has tracing and probing infrastructure for a reason and it should be possible to be lazy about it. In kernel 3.13 there is even a dedicated tracepoint for dealing with faults although right now this is on a 3.12 kernel. The options include dynamic probing, userspace stack tracing and a host of others but in this case the easiest by far is a systemtap script to hold it all together.

        global fault_address%, fault_eip%, user_trace%
        global nr_samples

        function userspace_eip:long () %{
                THIS->__retvalue = task_pt_regs(current)->ip;
        %}

        probe vm.pagefault {
                p = pid()
                key_address = sprintf("%d:%s:%p", p, execname(), address);
                key_eip     = sprintf("%d:%s:%p", p, execname(), userspace_eip());

                fault_address[key_address]++
                fault_eip[key_eip]++
                if (fault_eip[key_eip] == 100) {
                        user_trace[key_eip] = sprint_usyms(ubacktrace())
                }
                nr_samples++
        }

        probe timer.ms(5000) {
                printf("ndata address samplesn")
                foreach (key in fault_address- limit 5) {
                        printf ("%-30s%6dn", key, fault_address[key])
                }
                printf("eip samplesn")
                foreach (key in fault_eip- limit 5) {
                        printf ("%-30s%6dn", key, fault_eip[key])
                        if (fault_eip[key] >= 100) {
                                printf ("%sn", user_trace[key])
                        }
                }
                delete fault_address
                delete fault_eip
                delete user_trace
        }

Straight-forward enough, fire it up while ebizzy is running and after a few samples we get

        compass:~ # stap -d /lib64/libc-2.18.so -d /lib64/libpthread-2.18.so -d /lib64/ld-2.18.so -d ./ebizzy -DSTP_NO_OVERLOAD -DMAXSKIPPED=1000000 -g ./top-page-faults.stp
        eip samples
        6927:ebizzy:0x7f0772253006    118733
         0x7f0772253006 : __memcpy_sse2_unaligned+0xe6/0x220 [/lib64/libc-2.18.so]
         0x4017ba : search_mem+0xf5/0x1fe [/root/ebizzy]
         0x4018ff : thread_run+0x3c/0x96 [/root/ebizzy]
         0x7f077257a0db : start_thread+0xcb/0x300 [/lib64/libpthread-2.18.so]
         0x7f07722aa90d : clone+0x6d/0x90 [/lib64/libc-2.18.so]

        6927:ebizzy:0x7f077223e669      1245
         0x7f077223e669 : _int_malloc+0xaf9/0x1360 [/lib64/libc-2.18.so]
         0x7f077223fff9 : __libc_malloc+0x69/0x170 [/lib64/libc-2.18.so]
         0x401320 : alloc_mem+0x64/0xb8 [/root/ebizzy]
         0x40175a : search_mem+0x95/0x1fe [/root/ebizzy]
         0x4018ff : thread_run+0x3c/0x96 [/root/ebizzy]
         0x7f077257a0db : start_thread+0xcb/0x300 [/lib64/libpthread-2.18.so]
         0x7f07722aa90d : clone+0x6d/0x90 [/lib64/libc-2.18.so]

        7046:nis:0x7f4dc43844b7          104
         0x7f4dc43844b7 : _int_malloc+0x947/0x1360 [/lib64/libc-2.18.so]
         0x7f4dc4385ff9 : __libc_malloc+0x69/0x170 [/lib64/libc-2.18.so]
         0x475008 : 0x475008 [/bin/bash+0x75008/0x9c000]

        7047:nis:0x7f4dc43830da           33
        7049:systemctl:0x7f28d6af813f     33

        # Translate the top addresses in ebizzy of interest
        compass:~ # addr2line -e ebizzy 0x4017ba
        ebizzy-0.3/ebizzy.c:413
        compass:~ # addr2line -e ebizzy 0x401320
        ebizzy-0.3/ebizzy.c:254
        compass:~ # addr2line -e ebizzy 0x40175a
        ebizzy-0.3/ebizzy.c:402

        compass:~ # sed -n '410,415p' /root/git-private/mmtests-test/work/testdisk/sources/ebizzy-0.3/ebizzy.c
                        else
                                memcpy(copy, src, copy_size);

                        key = rand_num(copy_size / record_size, state);

                        if (verbose > 2)
        compass:~ # sed -n '395,405p' /root/git-private/mmtests-test/work/testdisk/sources/ebizzy-0.3/ebizzy.c
                /
                  If we're doing random sizes, we need a non-zero
                  multiple of record size.
                 /
                if (random_size)
                        copy_size = (rand_num(chunk_size / record_size, state)
                                     + 1)  record_size;
                copy = alloc_mem(copy_size);

                if ( touch_pages ) {
                        touch_mem((char ) copy, copy_size);

The causes of faults are memory copies to a newly allocation short-lived buffer that gets freed immediately and all this takes place in search_mem(). It’s obvious when you have the tools to point you in the right direction. Working around the problem after that is up to you really. strace would reveal that the workload is dominated by calls to madvise as this very crude example illustrates

        compass:~ # strace -f -e mmap,madvise ./ebizzy -S 2 -t 48 2>&1 | grep ( | cut -d( -f1 | awk '{print $NF}' | sort | uniq -c
          17733 madvise
            161 mmap

Given infinite free time the options would be to alter ebizzy to be less allocator intensive to avoid this paths in the kernel altogether or modify madvise(MADV_DONTNEED) to not suck but how to do so is beyond the scope of the block.

The key is that workload analysis is not impossible if you just spend time to look for the available tools and then ask questions the tools can answer then for you.