StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POMATLAB: Saving several variables to "-v7.3" (HDF5) .mat-files seems to be faster when using the "-append" flag. How come?
primarykey
Id
4949939
data
AcceptedAnswerId
4952486
AnswerCount
3
ClosedDate
CommentCount
10
CommunityOwnedDate
CreationDate
2011-02-09T20:13:46.260
FavoriteCount
5
LastActivityDate
2016-07-15T12:17:39.183
LastEditDate
2016-07-15T12:17:39.183
LastEditorUserId
582867
OwnerUserId
582867
ParentId
0
PostTypeId
1
Score
11
ViewCount
6023
LastEditorDisplayName
text
Body
NOTE: This question deals with an issue observed back in 2011 with an old MATLAB version (R2009a). As per the update below from July 2016, the issue/bug in MATLAB seems to no longer exist (tested with R2016a; scroll down to end of question to see update). I am using MATLAB R2009b and I need to write a larger script that converts the contents of a larger set of .zip files to v7.3 mat files (with an underlying HDF5-datamodel). Reading is OK. The issue is with saving. And there is actually no problem. My files saves nicely using the save command. My question is more in the sense: Why am I observing the following surprising (for me) behavior in MATLAB? let's look at my issue in general. In this current test-scenario I will be generating one output: A -v7.3 mat-file. This .mat-file will contain 40 blocks as individual variables. Each variable will be named "block_NNN" from 1 to 40 and will contain a struct with fields frames and blockNo. Field frames contains a 480x240x65 sequence of uint8 imagedata (here just random data generated using randi). Field blockNo contains the block number. Remark: In the real script (that I have yet to finish) I will be doing the above at total of 370 times, converting a total of 108GB of raw data. Which is why I am concerned with the following. Anyway, first I define some general variables: <pre> % some sizes for dummy data and loops: num_blockCount = 40; num_blockLength = 65; num_frameHeight = 480; num_frameWidth = 240; </pre> I then generate some dummy code that has shape and size identical to the actual raw data: <pre> % generate empty struct: stu_data2disk = struct(); % loop over blocks: for num_k = 1:num_blockCount % generate block-name: temp_str_blockName = sprintf('block_%03u', num_k); % generate temp struct for current block: temp_stu_value = struct(); temp_stu_value.frames = randi( ... [0 255], ... [num_frameHeight num_frameWidth num_blockLength], ... 'uint8' ... ); temp_stu_value.blockNo = num_k; % using dynamic field names: stu_data2disk.(sprintf('block_%03u', num_k)) = temp_stu_value; end </pre> I now have all my random test-data in a struct stu_data2disk. Now I would like to save the data using one of two possible methods. Let's try the simple one first: <pre> % save data (simple): disp('Save data the simple way:') tic; save converted.mat -struct stu_data2disk -v7.3; toc; </pre> The file is written without problems (286MB). The output is: <pre> Save data the simple way: Elapsed time is 14.004449 seconds. </pre> OK - then I remembered that I would like to follow the save-procedure over the 40 blocks. Thus instead of the above I loop over the blocks and append them in sequence: <pre> % save to file, using append: disp('Save data using -append:') tic; for num_k = 1:num_blockCount % generate block-name: temp_str_blockName = sprintf('block_%03u', num_k); temp_str_appendToggle = ''; if (num_k > 1) temp_str_appendToggle = '-append'; end % generate save command: temp_str_saveCommand = [ ... 'save ', ... 'converted_append.mat ', ... '-struct stu_data2disk ', temp_str_blockName, ' '... temp_str_appendToggle, ' ', ... '-v7.3', ... ';' ... ]; % evaluate save command: eval(temp_str_saveCommand); end toc; </pre> And again the file saves nicely (286MB). The output is: <pre> Save data using -append: Elapsed time is 0.956968 seconds. </pre> Interestingly the append-method is much faster? My question is why? Output from <code>dir converted*.mat</code>: <pre> 09-02-2011 20:38 300,236,392 converted.mat 09-02-2011 20:37 300,264,316 converted_append.mat 2 File(s) 600,500,708 bytes </pre> The files are not identical in size. And a test with fc in windows 7 revealed ... well many binary differences. Perhaps the data was shifted a bit - thus this tells us nothing. Does someone have an idea what is going on here? Is the appended file using a much more optimized data-structure perhaps? Or maybe windows has cached the file and makes access to it much faster? I made the effort of test-reading from the two files as well. Without presenting the numbers here the appended version was a little bit faster (could mean something in the long run though). [EDIT]: I just tried using no format flag (defaults to -v7 on my system) and there is not much difference anymore: <pre> Save data the simple way (-v7): Elapsed time is 13.092084 seconds. Save data using -append (-v7): Elapsed time is 14.345314 seconds. </pre> [EDIT]: I corrected the above mistake. Previously I mentioned that the stats were for -v6 but I was mistaken. I had just removed the format flag and assumed the default was -v6 but actually it is -v7. I have created new test stats for all formats on my system using Andrew's fine framework (all formats are for the same random test data, now read from file): <pre> 15:15:51.422: Testing speed, format=-v6, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional 6.1.7600 N/A Build 7600 15:16:00.829: Save the simple way: 0.358 sec 15:16:01.188: Save using multiple append: 7.432 sec 15:16:08.614: Save using one big append: 1.161 sec 15:16:24.659: Testing speed, format=-v7, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional 6.1.7600 N/A Build 7600 15:16:33.442: Save the simple way: 12.884 sec 15:16:46.329: Save using multiple append: 14.442 sec 15:17:00.775: Save using one big append: 13.390 sec 15:17:31.579: Testing speed, format=-v7.3, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional 6.1.7600 N/A Build 7600 15:17:40.690: Save the simple way: 13.751 sec 15:17:54.434: Save using multiple append: 3.970 sec 15:17:58.412: Save using one big append: 6.138 sec </pre> And the sizes of the files: <pre> 10-02-2011 15:16 299,528,768 converted_format-v6.mat 10-02-2011 15:16 299,528,768 converted_append_format-v6.mat 10-02-2011 15:16 299,528,832 converted_append_batch_format-v6.mat 10-02-2011 15:16 299,894,027 converted_format-v7.mat 10-02-2011 15:17 299,894,027 converted_append_format-v7.mat 10-02-2011 15:17 299,894,075 converted_append_batch_format-v7.mat 10-02-2011 15:17 300,236,392 converted_format-v7.3.mat 10-02-2011 15:17 300,264,316 converted_append_format-v7.3.mat 10-02-2011 15:18 300,101,800 converted_append_batch_format-v7.3.mat 9 File(s) 2,698,871,005 bytes </pre> Thus -v6 seems to be the fastest for writing. Also not any large differences in files sizes. HDF5 does have some basic inflate-method built-in as far as I know. Hmm, probably some optimization in the underlying HDF5-write functions? Currently I still think that some underlying fundamental HDF5-write function is optimized for adding datasets to an HDF5-file (which is what happens when adding new variables to a -7.3 file). I believe I have read somewhere that HDF5 should optimized in this very way... though cannot be sure. Other details to note: The behavior is very systemic as we see in Andrew's answer below. It also seems to be quite important as to whether or not you run these things in a local scope of a function or in the "global" of an m-script. My first results were from an m-script where files were written to the current directory. I can still only reproduce the 1-second write for -7.3 in the m-script. The function-calls add some overhead apparently. Update July 2016: I found this again and thought I might test it with the newest MATLAB available to me at the moment. With MATLAB R2016a on Windows 7 x64 the problem seems to have been fixed: <pre> 14:04:06.277: Testing speed, imax=255, R2016a on PCWIN64, arch=AMD64, 16 GB, os=Microsoft Windows 7 Enterprise Version 6.1 (Build 7601: Service Pack 1) 14:04:10.600: basic -v7.3: 7.599 sec 5.261 GB used 14:04:18.229: basic -v7.3: 7.894 sec 5.383 GB used 14:04:26.154: basic -v7.3: 7.909 sec 5.457 GB used 14:04:34.096: basic -v7.3: 7.919 sec 5.498 GB used 14:04:42.048: basic -v7.3: 7.886 sec 5.516 GB used 286 MB file 7.841 sec mean 14:04:50.581: multiappend -v7.3: 7.928 sec 5.819 GB used 14:04:58.544: multiappend -v7.3: 7.905 sec 5.834 GB used 14:05:06.485: multiappend -v7.3: 8.013 sec 5.844 GB used 14:05:14.542: multiappend -v7.3: 8.591 sec 5.860 GB used 14:05:23.168: multiappend -v7.3: 8.059 sec 5.868 GB used 286 MB file 8.099 sec mean 14:05:31.913: bigappend -v7.3: 7.727 sec 5.837 GB used 14:05:39.676: bigappend -v7.3: 7.740 sec 5.879 GB used 14:05:47.453: bigappend -v7.3: 7.645 sec 5.884 GB used 14:05:55.133: bigappend -v7.3: 7.656 sec 5.877 GB used 14:06:02.824: bigappend -v7.3: 7.963 sec 5.871 GB used 286 MB file 7.746 sec mean </pre> This was tested with Andrew Janke's <code>reproMatfileAppendSpeedup</code> function in the accepted answer below (5 passes with format 7.3). Now, <code>-append</code> is equally slow, or slower, to a single save - as it should be. Perhaps it was a problem with an early build of the HDF5 driver used in R2009a.
Tags
<file><matlab><file-io><hdf5><mat-file>
Title
MATLAB: Saving several variables to "-v7.3" (HDF5) .mat-files seems to be faster when using the "-append" flag. How come?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USOle Thomsen Buus
UserOwnerUserId
1. USOle Thomsen Buus
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POMATLAB: Saving several variables to "-v7.3" (HDF5) .mat-files seems to be faster when using the "-append" flag. How come?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POMATLAB: Saving several variables to "-v7.3" (HDF5) .mat-files seems to be faster when using the "-append" flag. How come?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POMATLAB: Saving several variables to "-v7.3" (HDF5) .mat-files seems to be faster when using the "-append" flag. How come?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.