StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPerformance Optimization for Matrix Rotation
primarykey
Id
2966723
data
AcceptedAnswerId
2966755
AnswerCount
3
ClosedDate
CommentCount
2
CommunityOwnedDate
2010-06-03T14:09:47.087
CreationDate
2010-06-03T14:09:20.283
FavoriteCount
1
LastActivityDate
2010-06-04T01:13:56.193
LastEditDate
2010-06-04T01:13:56.193
LastEditorUserId
258355
OwnerUserId
258355
ParentId
0
PostTypeId
1
Score
2
ViewCount
7132
LastEditorDisplayName
text
Body
I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following: In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as: Transpose: interchange elements M(i,j) and M(j,i) Exchange rows: Row i is exchanged with row N-1-i A example for matrix rotation(N is 3 instead of 32 for simplicity): <pre><code>------- ------- |1|2|3| |3|6|9| ------- ------- |4|5|6| after rotate is |2|5|8| ------- ------- |7|8|9| |1|4|7| ------- ------- </code></pre> A naive implementation is: <pre><code>#define RIDX(i,j,n) ((i)*(n)+(j)) void naive_rotate(int dim, pixel *src, pixel *dst) { int i, j; for (i = 0; i < dim; i++) for (j = 0; j < dim; j++) dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)]; } </code></pre> I come up with an idea by inner-loop-unroll. The result is: <pre><code>Code Version Speed Up original 1x unrolled by 2 1.33x unrolled by 4 1.33x unrolled by 8 1.55x unrolled by 16 1.67x unrolled by 32 1.61x </code></pre> I also get a code snippet from pastebin.com that seems can solve this problem: <pre><code>void rotate(int dim, pixel *src, pixel *dst) { int stride = 32; int count = dim >> 5; src += dim - 1; int a1 = count; do { int a2 = dim; do { int a3 = stride; do { *dst++ = *src; src += dim; } while(--a3); src -= dim * stride + 1; dst += dim - stride; } while(--a2); src += dim * (stride + 1); dst -= dim * dim - stride; } while(--a1); } </code></pre> After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version. Here are the questions: <ol> <li>In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?</li> <li>What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.</li> </ol> EDIT: My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4 Any advice will be highly appreciated! Kind regards!
Tags
<c><performance><optimization>
Title
Performance Optimization for Matrix Rotation
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USSummer_More_More_Tea
UserOwnerUserId
1. USSummer_More_More_Tea
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPerformance Optimization for Matrix Rotation
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POPerformance Optimization for Matrix Rotation
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POPerformance Optimization for Matrix Rotation
 UserUserId
 USzzyzif
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. CORealistically it also helps to think about why you are doing this. Math optimizations start at simplifying the math expressions. Why is this operation needed?
 singulars
 PostPostId
 POPerformance Optimization for Matrix Rotation
 UserUserId
 USHamish Grubijan
2. COYou're right. But this problem, I think, has more relation with the system architecture than math simplification.
 singulars
 PostPostId
 POPerformance Optimization for Matrix Rotation
 UserUserId
 USSummer_More_More_Tea

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.