blog


:(){ :|:& } ;:

The Inner Thoughts of a C Developer.


“Testing Memory I/O Bandwidth”

2013-10-16

Very frequently at my company, we find ourselves pushing our hardware to its limit. Usually we are able to dig in and find some optimizations that we may have missed before to squeeze some extra performance out of our products. This time started out a little different though. The issue we were seeing did not seem to be caused by CPU speed or memory capacity, but I/O bandwidth.

In an attempt to quadruple the output of one of our products, we hit a hard wall when running at the peak stress level. The developer on the project Stephen and I began brain storming on what the issue might be and why we were hitting a cap. All of the specs on paper seem to indicate we had more then enough machine to get the job done.

In order to start diagnosing our problem, we looked to a program called mbw which is a memory bandwidth benchmark tool. We installed it from the Ubuntu repository using sudo apt-get install mbw. As we found out later, this installed version 1.1.1 of this software (and yes this is important… keep reading). Running the software is easy. The simplest option is to just pass in an array size (in MB). For brevity I am only showing the average results, instead of all results.

$ mbw 32 | grep AVG
AVG     Method: MEMCPY  Elapsed: 0.00600  MiB: 32.00000   Copy: 5332.889 MiB/s
AVG     Method: DUMB    Elapsed: 0.00422   MiB: 32.00000   Copy: 7589.413 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00164  MiB: 32.00000   Copy: 19465.904 MiB/s

Another useful option is to specify the size of the “block” to use in the MCBLOCK test. To specify this option you can use the -b flag.

$ mbw -b 4096 32 | grep AVG
AVG     Method: MEMCPY  Elapsed: 0.00589  MiB: 32.00000   Copy: 5428.421 MiB/s
AVG     Method: DUMB    Elapsed: 0.00421   MiB: 32.00000   Copy: 7598.062 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00064  MiB: 32.00000   Copy: 50172.468 MiB/s

Woah! Hold the phone! Do you notice something about these results? Why is the MCBLOCK test a whole order of magnitude faster then the MEMCPY test? This made Stephen and my jaws drop. What was being done to get this much throughput? This is where the story really begins.

This first thing we did was grab a copy of the source. The first source that we found was for version 1.2.2 (At this time we didn’t know it was different then the version we had installed). We started digging through the source and found the worker function that performed the three tests.

if(type==1) { /* memcpy test */
  /* timer starts */
  gettimeofday(&starttime, NULL);
  memcpy(b, a, array_bytes);
  /* timer stops */
  gettimeofday(&endtime, NULL);
} 
else if(type==2) { /* memcpy block test */
  gettimeofday(&starttime, NULL);
  for(t=0; t<array_bytes; t+=block_size) {
      b=mempcpy(b, a, block_size);
  }
  if(t>array_bytes) {
      b=mempcpy(b, a, t-array_bytes);
  }
  gettimeofday(&endtime, NULL);
} 
else { /* dumb test */
  gettimeofday(&starttime, NULL);
  for(t=0; t<asize; t++) {
      b[t]=a[t];
  }
  gettimeofday(&endtime, NULL);
}

This is the code snippet from worker() in mbw.c on line 92. The first thing we discovered that that the MCBLOCK test was using the mempcpy() function. I had never used the mempcpy function before so I was intrigued! Of course the mystery only deepened when we looked at the mempcpy man page.

The mempcpy() function is nearly identical to the memcpy(3) function. It copies n bytes from the object beginning at src into the object pointed to by dest. But instead of returning the value of dest it returns a pointer to the byte following the last written byte.

mempcpy(3) man page

They really weren’t kidding with the “nearly identical” part either. As soon as I dug into the glibc source code, it became very apparent something strange was going on.

void *
__mempcpy (void *dest, const void *src, size_t len)
{
  return memcpy (dest, src, len) + len;
}
libc_hidden_def (__mempcpy)
weak_alias (__mempcpy, mempcpy)
libc_hidden_builtin_def (mempcpy)

So why was the mempcpy() code running so much faster than the memcpy() code if one is simply calling the other? The answer would soon surface! The next thing we did is compile the 1.2.2 source that we downloaded and ran it. To our amazement we were getting much lower bandwidth for what seemed like no reason.

$ ./mbw-1.2.2 -b 4096 8 | grep AVG
AVG     Method: MEMCPY  Elapsed: 0.00292  MiB: 8.00000    Copy: 2743.861 MiB/s
AVG     Method: DUMB    Elapsed: 0.00116  MiB: 8.00000    Copy: 6871.081 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00098  MiB: 8.00000    Copy: 8145.810 MiB/s

We didn’t understand, we didn’t change anything, we simply compiled and executed the code, and yet the Ubuntu package version was reporting huge bandwidths, and this version was not. I started to suspect that the version in the repo was different somehow, and I was right! We ran apt-get source mbw, and sure enough we got version 1.1.1. Running a diff between these two files showed that the MCBLOCK test was updated.

/* in version 1.1.1 */
for(t=0; t<array_bytes; t+=block_size) {
   c=mempcpy(b,a,block_size); 
}

/* in version 1.2.2 */
for(t=0; t<array_bytes; t+=block_size) {
   b=mempcpy(b,a,block_size); 
}

Well, that solves that mystery! The issue was that, in version 1.1.1 (installed by apt-get) , the program was writing the same block_size chunk of memory over and over causing heavy cache hits and speed up. This new version properly advances the destination pointer, thus eliminating the cache hits and lowering the bandwidth measurements.

Now, does anything else stand out about the 1.2.2 code? Well if you guessed that the source pointer was not being advanced, you would be correct! So these numbers were still a bit off. After making the correction we got much more consistent measurements.

/* in the corrected version (now 1.3.0) */
char* aa = (char*)a;
char* bb = (char*)b;
gettimeofday(&starttime, NULL);
for (t=array_bytes; t >= block_size; t-=block_size, aa+=block_size){
   bb=mempcpy(bb, aa, block_size);
}
if(t) {
   bb=mempcpy(bb, aa, t);
}
gettimeofday(&endtime, NULL);
$ ./mbw-1.3.0 -b 4096 8 | grep AVG
AVG     Method: MEMCPY  Elapsed: 0.00288  MiB: 8.00000    Copy: 2778.067 MiB/s
AVG     Method: DUMB    Elapsed: 0.00113  MiB: 8.00000    Copy: 7107.952 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00166  MiB: 8.00000    Copy: 4817.246 MiB/s

I am happy to report that these changes were merged into the main line release at the raas/mbw github page. So if you are going to use mbw for benchmarking your memory throughput I highly recommend you use the new 1.3.0 version.

If you have a multi-cpu system and want to see what your total average throughput is you can use this script below. It will detect how many processors you have and spawn the matching number of mbw instances. It will then sum up the average measurements. Feel free to modify it as needed.

#! /usr/bin/env bash
# This will run an mbw instance for each core on the machine

NUMCORES=$(grep "processor" /proc/cpuinfo | wc -l)
TMP="/tmp/mbw_result_tmp"

echo "Starting test on $NUMCORES cores"
for (( i=0; i<$NUMCORES; i++ )); do
   mbw -b 4096 32 -n 100 > ${TMP}_${i} & 
done

echo "Waiting for tests to finish"
wait

MEMCPY_RESULTS=()
DUMB_RESULTS=()
MCBLOCK_RESULTS=()

for (( i=0; i<$NUMCORES; i++ )); do
   MEMCPY_RESULTS[$i]=`grep -E "AVG.*MEMCPY" ${TMP}_${i} | \
      tr "[:blank:]" " " | cut -d " " -f 9`

   DUMB_RESULTS[$i]=`grep -E "AVG.*DUMB" ${TMP}_${i} | \
      tr "[:blank:]" " " | cut -d " " -f 9`

   MCBLOCK_RESULTS[$i]=`grep -E "AVG.*MCBLOCK" ${TMP}_${i} | \
      tr "[:blank:]" " " | cut -d " " -f 9`
done

MEMCPY_SUM=0
DUMB_SUM=0
MCBLOCK_SUM=0

# Need to use `bc` because of floating point numbers
for (( i=0; i<$NUMCORES; i++ )); do
   MEMCPY_SUM=`echo "$MEMCPY_SUM + ${MEMCPY_RESULTS[$i]}" | bc -q`
   DUMB_SUM=`echo "$DUMB_SUM + ${DUMB_RESULTS[$i]}" | bc -q`
   MCBLOCK_SUM=`echo "$MCBLOCK_SUM + ${MCBLOCK_RESULTS[$i]}" | bc -q`
done

echo "MEMCPY Total AVG: $MEMCPY_SUM MiB/s"
echo "DUMB Total AVG: $DUMB_SUM MiB/s"
echo "MCBLOCK Total AVG: $MCBLOCK_SUM MiB/s"

Using the 1.3.0 version of mbw, as well as some good old fashion detective work, we were able to find the perfect combination of software and hardware optimizations to push our product to the next level. It is still in early beta, but hopefully in a few months it will be finalized and I can release more details!


comments powered by Disqus