Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>A vector type should be compiled into registers <em>if</em> there is available registers to do so. Turning your snippet into something that will survive dead code removal:</p> <pre><code>__global__ void func(ushort4 *out) { ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(1.0)); out[threadIdx.x+blockDim.x*blockIdx.x] = result; } </code></pre> <p>and compiling it:</p> <pre><code>&gt;nvcc -cubin -arch=sm_20 -Xptxas="-v" ushort4.cu ushort4.cu ushort4.cu tmpxft_000010b8_00000000-3_ushort4.cudafe1.gpu tmpxft_000010b8_00000000-10_ushort4.cudafe2.gpu ptxas info : Compiling entry function '_Z4funcP7ushort4' for 'sm_20' ptxas info : Function properties for _Z4funcP7ushort4 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 8 registers, 36 bytes cmem[0] </code></pre> <p>shows no spills (ie. local memory). Further, disassembling the resulting cubin file shows:</p> <pre><code>&gt;cuobjdump --dump-sass ushort4.cubin code for sm_20 Function : _Z4funcP7ushort4 /*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; /*0008*/ /*0x01101c041000cfc0*/ F2F.F16.F32 R0, 0x3f000; /*0010*/ /*0x94009c042c000000*/ S2R R2, SR_CTAid_X; /*0018*/ /*0x8400dc042c000000*/ S2R R3, SR_Tid_X; /*0020*/ /*0x01111c041000cfe0*/ F2F.F16.F32 R4, 0x3f800; /*0028*/ /*0x00915c041c000000*/ I2I.U16.U16 R5, R0; /*0030*/ /*0x20209c0320064000*/ IMAD.U32.U32 R2, R2, c [0x0] [0x8], R3; /*0038*/ /*0x40019c03280ac040*/ BFI R6, R0, 0x1010, R5; /*0040*/ /*0x4041dc03280ac040*/ BFI R7, R4, 0x1010, R5; /*0048*/ /*0x80201c6340004000*/ ISCADD R0, R2, c [0x0] [0x20], 0x3; /*0050*/ /*0x00019ca590000000*/ ST.64 [R0], R6; /*0058*/ /*0x00001de780000000*/ EXIT; ................................. </code></pre> <p>ie. the <code>ushort4</code> is stuffed into register and then a 64 bit store is used to write the packed vector out to global memory. No local memory access to be seen. </p> <p>So if you have convinced yourself that you have a vector value compiling into local memory, it is either because you have a kernel with a lot of register pressure, or you are asking the compiler to (the <code>volatile</code> keyword will do that), or you have misinterpreted what the compiler/assembler are telling you at compile time.</p> <hr> <p>EDIT: Using the CUDA 4.0 release tookit with Visual Studio Express 2008 and compiling on 32bit Windows 7 for a compute 1.1 device gives:</p> <pre><code>&gt;nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2011 NVIDIA Corporation Built on Fri_May_13_02:42:40_PDT_2011 Cuda compilation tools, release 4.0, V0.2.1221 &gt;cl.exe Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. usage: cl [ option... ] filename... [ /link linkoption... ] &gt;nvcc -cubin -arch=sm_11 -Xptxas=-v ushort4.cu ushort4.cu ushort4.cu tmpxft_00001788_00000000-3_ushort4.cudafe1.gpu tmpxft_00001788_00000000-10_ushort4.cudafe2.gpu ptxas info : Compiling entry function '_Z4funcP7ushort4' for 'sm_11' ptxas info : Used 4 registers, 4+16 bytes smem </code></pre> <p>which is the exact same result as for the original build for a compute 2.0 target.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload