StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POBenchmarking (python vs. c++ using BLAS) and (numpy)
primarykey
Id
7596612
data
AcceptedAnswerId
7614252
AnswerCount
4
ClosedDate
CommentCount
10
CommunityOwnedDate
CreationDate
2011-09-29T11:23:25.333
FavoriteCount
54
LastActivityDate
2014-07-31T06:46:33.997
LastEditDate
2011-10-21T11:42:06.297
LastEditorUserId
572616
OwnerUserId
572616
ParentId
0
PostTypeId
1
Score
99
ViewCount
31108
LastEditorDisplayName
text
Body
<p>I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.</p> <p>I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:</p> <ol> <li>Numpy, making use only of the functionality of <code>dot</code>.</li> <li>Python, calling the BLAS functionalities through a shared object.</li> <li>C++, calling the BLAS functionalities through a shared object.</li> </ol> <h2>Scenario</h2> <p>I implemented a matrix-matrix multiplication for different dimensions <code>i</code>. <code>i</code> runs from 5 to 500 with an increment of 5 and the matricies <code>m1</code> and <code>m2</code> are set up like this:</p> <pre><code>m1 = numpy.random.rand(i,i).astype(numpy.float32) m2 = numpy.random.rand(i,i).astype(numpy.float32) </code></pre> <h2>1. Numpy</h2> <p>The code used looks like this:</p> <pre><code>tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2") rNumpy.append((i, tNumpy.repeat(20, 1))) </code></pre> <h2>2. Python, calling BLAS through a shared object</h2> <p>With the function</p> <pre><code>_blaslib = ctypes.cdll.LoadLibrary("libblas.so") def Mul(m1, m2, i, r): no_trans = c_char("n") n = c_int(i) one = c_float(1.0) zero = c_float(0.0) _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), r.ctypes.data_as(ctypes.c_void_p), byref(n)) </code></pre> <p>the test code looks like this:</p> <pre><code>r = numpy.zeros((i,i), numpy.float32) tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul") rBlas.append((i, tBlas.repeat(20, 1))) </code></pre> <h2>3. c++, calling BLAS through a shared object</h2> <p>Now the c++ code naturally is a little longer so I reduce the information to a minimum.<br> I load the function with</p> <pre><code>void* handle = dlopen("libblas.so", RTLD_LAZY); void* Func = dlsym(handle, "sgemm_"); </code></pre> <p>I measure the time with <code>gettimeofday</code> like this:</p> <pre><code>gettimeofday(&start, NULL); f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim); gettimeofday(&end, NULL); dTimes[j] = CalcTime(start, end); </code></pre> <p>where <code>j</code> is a loop running 20 times. I calculate the time passed with</p> <pre><code>double CalcTime(timeval start, timeval end) { double factor = 1000000; return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor; } </code></pre> <h2>Results</h2> <p>The result is shown in the plot below: </p> <p><img src="https://i.stack.imgur.com/6Yauw.png" alt="enter image description here"></p> <h2>Questions</h2> <ol> <li>Do you think my approach is fair, or are there some unnecessary overheads I can avoid?</li> <li>Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.</li> <li>Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?</li> </ol> <h2>Download</h2> <p>The complete benchmark can be downloaded <a href="https://github.com/zed/woltan-benchmark/" rel="noreferrer">here</a>. (J.F. Sebastian made that link possible^^)</p>
Tags
<c++><python><numpy><benchmarking><blas>
Title
Benchmarking (python vs. c++ using BLAS) and (numpy)
singulars
PostAcceptedAnswerId
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USWoltan
UserOwnerUserId
1. USWoltan
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
2. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
3. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
2. PO
  singulars
  PostTypePostTypeId
  PTAnswer
3. PO
  singulars
  PostTypePostTypeId
  PTAnswer
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  POBenchmarking (python vs. c++ using BLAS) and (numpy)
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  POBenchmarking (python vs. c++ using BLAS) and (numpy)
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  POBenchmarking (python vs. c++ using BLAS) and (numpy)
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.