Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to use DSP to speed-up a code on OMAP?
    primarykey
    data
    text
    <p>I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP). </p> <p>I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).</p> <p>I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?</p> <h2>edit:</h2> <ol> <li>GPP application allocates a buffer of size 0x1000 bytes.</li> <li>GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.</li> <li>GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.</li> <li>GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.</li> <li>GPP invokes memcpy to copy the data to be processed into the shared memory.</li> <li>GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.</li> <li>GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.</li> </ol> <p>After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COI've generally used the TI Code Composer tool chain when dealing with the C6xxx series so I won't be able to tell you specific Linux calls to use, but the C64xx hardware is roughly the same regardless. When posible you want to avoid "handholding" for copying data. I would expect the memcpy to tie up the processor in a copy loop, if possible find a DMA operation from the ARM to move the data. Try and do whatever processing you can WHILE DMA operations are being performed in parallel. Worse case you should be able to configure the C64xx to move data by programming the DMA registers directly.
      singulars
    2. COLook at the documentation for the board, I can't find too much info on it because they are pushing the next generation of the Zoom kit, DMA operations will be able to transfer data at Buswidth*Clockspeed for syncronous busses. If the processing you are doing does not require a nested loop, then it may take longer to move it over an external bus than doing the processing locally. The external buses are typically 1/2 or 1/4 the core speed for parts in 400Mhz range. If it's an async bus then it's even worse.
      singulars
    3. COIF the bus speed is 100Mhz (to be conservative) and the word size is 32bits (C64xx can have a 64bit bus on the EMIF for standard parts. On OMAP IDK?) then (4096 bytes/4 bytes) / 100Mhz = 10.24us transfer time should be achivable for that amount of data using DMA.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload