Note that there are some explanatory texts on larger screens.

plurals
  1. POUnderstand whether code sample is CPU bound or Memory bound
    primarykey
    data
    text
    <p>As a general question to those working on optimization and performance tuning of programs, how do you figure out if your code is CPU bound or Memory bound? I understand these concepts in general, but if I have say, 'y' amounts of loads and stores and '2y' computations, how does one go about finding what is the bottleneck?</p> <p>Also can you figure out where exactly you are spending most of your time and say, if you load 'x' amount of data into cache (if its memory bound), in every loop iteration, then your code will run faster? Is there any precise way to determine this 'x', other than trial and error?</p> <p>Are there any tools that you'll use, say on the IA-32 or IA-64 architecture? Doest VTune help?</p> <p>For example, I'm currently doing the following:</p> <p>I have 26 8*8 matrices of complex doubles and I have to perform a MVM (matrix vector multiplication) with (~4000) vectors of length 8, for each of these 26 matrices. I use SSE to perform the complex multiplication.</p> <pre><code>/*Copy 26 matrices to temporary storage*/ for(int i=0;i&lt;4000;i+=2){//Loop over the 4000 vectors for(int k=0;k&lt;26;k++){//Loop over the 26 matrices /* Perform MVM in blocks of '2' between kth matrix and 'i' and 'i+1' vector */ } } </code></pre> <p>The 26 matrices take 26kb (L1 cache is 32KB) and I have laid the vectors out in memory such that I have stride'1' accesses. Once I perform MVM on a vector with the 27 matrices, I don't visit them again, so I don't think cache blocking will help. I have used vectorization but I'm still stuck on 60% of peak performance.</p> <p>I tried copying, say 64 vectors, into temporary storage, for every iteration of the outer loop thinking they'll be in cache and help, but its only decreased performance. I tried using _mm_prefetch() in the following way: When I am done with about half the matrices, I load the next 'i' and 'i+1' vector into memory, but that too hasn't helped.</p> <p>I have done all this assuming its memory bound but I want to know for sure. Is there a way?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload