Note that there are some explanatory texts on larger screens.

plurals
  1. USnjuffa
    primarykey
    data
    text
    plurals
    1. This table or related slice is empty.
    1. COUse of FMA may increase register pressure slightly, because three source operands must be available at the same time. So turning FMA generation on / off can lead to small differences in instruction scheduling and register allocation, which in turn can lead to small performance differences. For a compute-bound kernel with many multiply-add idioms, -fmad=true should make a significant performance difference, but as you say, your kernel is dominated by multiplies and thus will benefit little from use of FMA, and any gains may be offset by the register pressure / instruction scheduling aspects.
      singulars
    2. COCUDA 4.2 Programming Guide, section 5.3.2.1 Global Memory: Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e. whose first address is a multiple of their size) can be read or written by memory transactions.
      singulars
    3. CONote that an occupancy of 0.33 is not bad, there should be enough threads to cover memory latency. To achieve higher occupancy, it may be a good idea to reduce the block size to 128 threads, this should enabled 640 threads (5 blocks of 128 threads each) to run concurrently on each SM. There is no strong correlation between occupancy and performance, once the basic latencies are covered.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload