StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
18735505
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
12
CommunityOwnedDate
CreationDate
2013-09-11T07:49:53.157
FavoriteCount
0
LastActivityDate
2013-09-12T07:58:36.693
LastEditDate
2013-09-12T07:58:36.693
LastEditorUserId
498899
OwnerUserId
498899
ParentId
18707984
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
Following the other suggestions, using <code>p=np.asfortranarray(p)</code> before the timer indeed puts the performance on par with numpy when I tested it. I extended the range for the bivariate bench to <code>n_bi = np.array([2**i for i in xrange(1, 15)])</code>, so that the p matrix would be larger than my L3 cache size. To further optimize this, I don't think automatic compiler options will be much help, since the inner loop has a dependency. Only if you manually unroll it, does <code>ifort</code> vectorize the innermost loop. With <code>gfortran</code>, <code>-O3</code> and <code>-ffast-math</code> were needed. For matrix sizes limited by main memory bandwidth, this increase the performance benefit over numpy from a factor of 1 to 3. Update: after applying this also to the univariate code and compiling with <code>f2py --opt='-O3 -ffast-math' -c -m polynomial polynomial.f90</code>, I get the following for the source and results for benchmark.py: <pre><code>subroutine polyval(p, x, pval, nx) implicit none real*8, dimension(nx), intent(in) :: p real*8, intent(in) :: x real*8, intent(out) :: pval integer, intent(in) :: nx integer, parameter :: simd = 8 real*8 :: tmp(simd), vecx(simd), xfactor integer :: i, j, k ! precompute factors do i = 1, simd vecx(i)=x**(i-1) end do xfactor = x**simd tmp = 0.0d0 do i = 1, nx, simd do k = 1, simd tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1) end do end do pval = sum(tmp) end subroutine polyval subroutine polyval2(p, x, y, pval, nx, ny) implicit none real*8, dimension(nx, ny), intent(in) :: p real*8, intent(in) :: x, y real*8, intent(out) :: pval integer, intent(in) :: nx, ny integer, parameter :: simd = 8 real*8 :: tmp(simd), vecx(simd), xfactor integer :: i, j, k ! precompute factors do i = 1, simd vecx(i)=x**(i-1) end do xfactor = x**simd ! horner pval=0.0d0 do i = 1, ny tmp = 0.0d0 do j = 1, nx, simd ! inner vectorizable loop do k = 1, simd tmp(k) = tmp(k)*xfactor + p(nx-(j+k-1)+1,ny-i+1)*vecx(simd-k+1) end do end do pval = pval*y + sum(tmp) end do end subroutine polyval2 </code></pre> Update 2: As pointed out, this code is not correct, at least when sizes are not divisible by <code>simd</code>. It's just showing the concept of manually helping the compiler, so don't just use it like this. If the sizes are not powers of two, a small remainder loop has to take care of the dangling indices. It's not so difficult to do this, here is the correct procedure for the univariate case, should be straightforward to extend it to bivariate: <pre><code>subroutine polyval(p, x, pval, nx) implicit none real*8, dimension(nx), intent(in) :: p real*8, intent(in) :: x real*8, intent(out) :: pval integer, intent(in) :: nx integer, parameter :: simd = 4 real*8 :: tmp(simd), vecx(simd), xfactor integer :: i, j, k, nr ! precompute factors do i = 1, simd vecx(i)=x**(i-1) end do xfactor = x**simd ! check remainder nr = mod(nx, simd) ! horner tmp = 0.0d0 do i = 1, nx-nr, simd do k = 1, simd tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1) end do end do pval = sum(tmp) ! do remainder pval = pval * x**nr do i = 1, nr pval = pval + p(i) * vecx(i) end do end subroutine polyval </code></pre> <img src="https://i.stack.imgur.com/QBN6R.png" alt="univariate"> <img src="https://i.stack.imgur.com/rVYkn.png" alt="bivariate"> Also, one should be careful with very small sizes, as the time will be too small to have an accurate performance profile. Also, relative times with respect to <code>numpy</code> could be deceiving, as the absolute time with numpy could be very bad. So below are timings for the largest case: For univariate with nx=220, time is 1.21 s for numpy, and 1.69e-3 s for the custom fortran version. For bivariate with nxny=220, time is 8e-3 s for numpy, and 1.68e-3 s for the custom version. The fact that the time for both univariate and bivariate is the same when the total nxny size is the same is very important, as it supports the fact that the code is performing near the memory bandwidth limit. Update 3: with the new python script for smaller sizes, and <code>simd=4</code> I get the following performance: <img src="https://i.stack.imgur.com/0Zz2t.png" alt="enter image description here"> <img src="https://i.stack.imgur.com/Xdxkf.png" alt="enter image description here"> Update 4: As for correctness, the results are the same within double precision accuracy, which you can see if you run this python code for the univariate example: <pre><code>import polynomial as P import numpy.polynomial.polynomial as PP import numpy as np for n in xrange(2,100): poly1n = np.random.rand(n) poly1f = np.asfortranarray(poly1n) x = 2 print "%18.14e" % P.polyval(poly1f, x) print "%18.14e" % PP.polyval(x, poly1n) print (P.polyval(poly1f, x) - PP.polyval(x, poly1n))/PP.polyval(x,poly1n), '\n' </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhy is univariate Horner in Fortran faster than NumPy counterpart while bivariate Horner is not
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USsteabert
UserOwnerUserId
1. USsteabert
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POWhy is univariate Horner in Fortran faster than NumPy counterpart while bivariate Horner is not
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.