StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POEfficient Way to Create Numpy Arrays from Binary Files
primarykey
Id
7569563
data
AcceptedAnswerId
7576420
AnswerCount
4
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2011-09-27T13:01:58.637
FavoriteCount
9
LastActivityDate
2011-09-28T13:55:03.650
LastEditDate
2011-09-28T12:33:05.410
LastEditorUserId
862008
OwnerUserId
862008
ParentId
0
PostTypeId
1
Score
13
ViewCount
13114
LastEditorDisplayName
text
Body
I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure: File Header <pre><code>149 Byte ASCII Header </code></pre> Record Start <pre><code>4 Byte Int - Record Timestamp </code></pre> Sample Start <pre><code>2 Byte Int - Data Stream 1 Sample 2 Byte Int - Data Stream 2 Sample 2 Byte Int - Data Stream 3 Sample 2 Byte Int - Data Stream 4 Sample </code></pre> Sample End There are 122,880 Samples per Record and 713 Records per File. This yields a total size of 700,910,521 Bytes. The sample rate and number of records does vary sometimes so I have to code for detection of the number of each per file. Currently the code I use to import this data into arrays works like this: <pre><code>from time import clock from numpy import zeros , int16 , int32 , hstack , array , savez from struct import unpack from os.path import getsize start_time = clock() file_size = getsize(input_file) with open(input_file,'rb') as openfile: input_data = openfile.read() header = input_data[:149] record_size = int(header[23:31]) number_of_records = ( file_size - 149 ) / record_size sample_rate = ( ( record_size - 4 ) / 4 ) / 2 time_series = zeros(0,dtype=int32) t_series = zeros(0,dtype=int16) x_series = zeros(0,dtype=int16) y_series = zeros(0,dtype=int16) z_series = zeros(0,dtype=int16) for record in xrange(number_of_records): time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 ) unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] ) record_t = zeros(sample_rate , dtype=int16) record_x = zeros(sample_rate , dtype=int16) record_y = zeros(sample_rate , dtype=int16) record_z = zeros(sample_rate , dtype=int16) for sample in xrange(sample_rate): record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ] record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ] record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ] record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ] time_series = hstack ( ( time_series , time_stamp ) ) t_series = hstack ( ( t_series , record_t ) ) x_series = hstack ( ( x_series , record_x ) ) y_series = hstack ( ( y_series , record_y ) ) z_series = hstack ( ( z_series , record_z ) ) savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series) end_time = clock() print 'Total Time',end_time - start_time,'seconds' </code></pre> This currently takes about 250 seconds per 700 MB file, which to me seems very high. Is there a more efficient way I could do this? <h1>Final Solution</h1> Using the numpy fromfile method with a custom dtype cut the runtime to 9 seconds, 27x faster than the original code above. The final code is below. <pre><code>from numpy import savez, dtype , fromfile from os.path import getsize from time import clock start_time = clock() file_size = getsize(input_file) openfile = open(input_file,'rb') header = openfile.read(149) record_size = int(header[23:31]) number_of_records = ( file_size - 149 ) / record_size sample_rate = ( ( record_size - 4 ) / 4 ) / 2 record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] ) data = fromfile(openfile , dtype = record_dtype , count = number_of_records ) time_series = data['timestamp'] t_series = data['samples'][:,:,0].ravel() x_series = data['samples'][:,:,1].ravel() y_series = data['samples'][:,:,2].ravel() z_series = data['samples'][:,:,3].ravel() savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series) end_time = clock() print 'It took',end_time - start_time,'seconds' </code></pre>
Tags
<python><numpy>
Title
Efficient Way to Create Numpy Arrays from Binary Files
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USStu
UserOwnerUserId
1. USStu
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POEfficient Way to Create Numpy Arrays from Binary Files
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POEfficient Way to Create Numpy Arrays from Binary Files
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POEfficient Way to Create Numpy Arrays from Binary Files
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COIs it medical data? EDF? If you don't know what I'm talking about, nevermind... ;o) Anyway, take a look at my answer, which I use to open medical data binary files according to this question: http://stackoverflow.com/q/5804052/401828 . There is an interesting discussion there.
 singulars
 PostPostId
 POEfficient Way to Create Numpy Arrays from Binary Files
 UserUserId
 USheltonbiker
2. CONo the data is geophysical. I saw your question while researching before posting. Your data consists of nothing but short ints, where I unfortunately have the 4 byte int timestamps scattered throughout the stream.
 singulars
 PostPostId
 POEfficient Way to Create Numpy Arrays from Binary Files
 UserUserId
 USStu

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.