Note that there are some explanatory texts on larger screens.

plurals
  1. POEfficient Way to Create Numpy Arrays from Binary Files
    primarykey
    data
    text
    <p>I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:</p> <p><em>File Header</em></p> <pre><code>149 Byte ASCII Header </code></pre> <p><em>Record Start</em></p> <pre><code>4 Byte Int - Record Timestamp </code></pre> <p><em>Sample Start</em></p> <pre><code>2 Byte Int - Data Stream 1 Sample 2 Byte Int - Data Stream 2 Sample 2 Byte Int - Data Stream 3 Sample 2 Byte Int - Data Stream 4 Sample </code></pre> <p><em>Sample End</em></p> <p>There are 122,880 Samples per Record and 713 Records per File. This yields a total size of 700,910,521 Bytes. The sample rate and number of records does vary sometimes so I have to code for detection of the number of each per file.</p> <p>Currently the code I use to import this data into arrays works like this:</p> <pre><code>from time import clock from numpy import zeros , int16 , int32 , hstack , array , savez from struct import unpack from os.path import getsize start_time = clock() file_size = getsize(input_file) with open(input_file,'rb') as openfile: input_data = openfile.read() header = input_data[:149] record_size = int(header[23:31]) number_of_records = ( file_size - 149 ) / record_size sample_rate = ( ( record_size - 4 ) / 4 ) / 2 time_series = zeros(0,dtype=int32) t_series = zeros(0,dtype=int16) x_series = zeros(0,dtype=int16) y_series = zeros(0,dtype=int16) z_series = zeros(0,dtype=int16) for record in xrange(number_of_records): time_stamp = array( unpack( '&lt;l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 ) unpacked_record = unpack( '&lt;' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] ) record_t = zeros(sample_rate , dtype=int16) record_x = zeros(sample_rate , dtype=int16) record_y = zeros(sample_rate , dtype=int16) record_z = zeros(sample_rate , dtype=int16) for sample in xrange(sample_rate): record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ] record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ] record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ] record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ] time_series = hstack ( ( time_series , time_stamp ) ) t_series = hstack ( ( t_series , record_t ) ) x_series = hstack ( ( x_series , record_x ) ) y_series = hstack ( ( y_series , record_y ) ) z_series = hstack ( ( z_series , record_z ) ) savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series) end_time = clock() print 'Total Time',end_time - start_time,'seconds' </code></pre> <p>This currently takes about 250 seconds per 700 MB file, which to me seems very high. Is there a more efficient way I could do this?</p> <h1>Final Solution</h1> <p>Using the numpy fromfile method with a custom dtype cut the runtime to 9 seconds, 27x faster than the original code above. The final code is below.</p> <pre><code>from numpy import savez, dtype , fromfile from os.path import getsize from time import clock start_time = clock() file_size = getsize(input_file) openfile = open(input_file,'rb') header = openfile.read(149) record_size = int(header[23:31]) number_of_records = ( file_size - 149 ) / record_size sample_rate = ( ( record_size - 4 ) / 4 ) / 2 record_dtype = dtype( [ ( 'timestamp' , '&lt;i4' ) , ( 'samples' , '&lt;i2' , ( sample_rate , 4 ) ) ] ) data = fromfile(openfile , dtype = record_dtype , count = number_of_records ) time_series = data['timestamp'] t_series = data['samples'][:,:,0].ravel() x_series = data['samples'][:,:,1].ravel() y_series = data['samples'][:,:,2].ravel() z_series = data['samples'][:,:,3].ravel() savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series) end_time = clock() print 'It took',end_time - start_time,'seconds' </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload