Note that there are some explanatory texts on larger screens.

plurals
  1. POBenchmarking (python vs. c++ using BLAS) and (numpy)
    primarykey
    data
    text
    <p>I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.</p> <p>I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:</p> <ol> <li>Numpy, making use only of the functionality of <code>dot</code>.</li> <li>Python, calling the BLAS functionalities through a shared object.</li> <li>C++, calling the BLAS functionalities through a shared object.</li> </ol> <h2>Scenario</h2> <p>I implemented a matrix-matrix multiplication for different dimensions <code>i</code>. <code>i</code> runs from 5 to 500 with an increment of 5 and the matricies <code>m1</code> and <code>m2</code> are set up like this:</p> <pre><code>m1 = numpy.random.rand(i,i).astype(numpy.float32) m2 = numpy.random.rand(i,i).astype(numpy.float32) </code></pre> <h2>1. Numpy</h2> <p>The code used looks like this:</p> <pre><code>tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2") rNumpy.append((i, tNumpy.repeat(20, 1))) </code></pre> <h2>2. Python, calling BLAS through a shared object</h2> <p>With the function</p> <pre><code>_blaslib = ctypes.cdll.LoadLibrary("libblas.so") def Mul(m1, m2, i, r): no_trans = c_char("n") n = c_int(i) one = c_float(1.0) zero = c_float(0.0) _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), r.ctypes.data_as(ctypes.c_void_p), byref(n)) </code></pre> <p>the test code looks like this:</p> <pre><code>r = numpy.zeros((i,i), numpy.float32) tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul") rBlas.append((i, tBlas.repeat(20, 1))) </code></pre> <h2>3. c++, calling BLAS through a shared object</h2> <p>Now the c++ code naturally is a little longer so I reduce the information to a minimum.<br> I load the function with</p> <pre><code>void* handle = dlopen("libblas.so", RTLD_LAZY); void* Func = dlsym(handle, "sgemm_"); </code></pre> <p>I measure the time with <code>gettimeofday</code> like this:</p> <pre><code>gettimeofday(&amp;start, NULL); f(&amp;no_trans, &amp;no_trans, &amp;dim, &amp;dim, &amp;dim, &amp;one, A, &amp;dim, B, &amp;dim, &amp;zero, Return, &amp;dim); gettimeofday(&amp;end, NULL); dTimes[j] = CalcTime(start, end); </code></pre> <p>where <code>j</code> is a loop running 20 times. I calculate the time passed with</p> <pre><code>double CalcTime(timeval start, timeval end) { double factor = 1000000; return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor; } </code></pre> <h2>Results</h2> <p>The result is shown in the plot below: </p> <p><img src="https://i.stack.imgur.com/6Yauw.png" alt="enter image description here"></p> <h2>Questions</h2> <ol> <li>Do you think my approach is fair, or are there some unnecessary overheads I can avoid?</li> <li>Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.</li> <li>Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?</li> </ol> <h2>Download</h2> <p>The complete benchmark can be downloaded <a href="https://github.com/zed/woltan-benchmark/" rel="noreferrer">here</a>. (J.F. Sebastian made that link possible^^)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload