StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhat's the most efficient way to process massive amounts of data from a disk using python?
primarykey
Id
4507545
data
AcceptedAnswerId
4507645
AnswerCount
2
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2010-12-22T08:53:12.240
FavoriteCount
0
LastActivityDate
2012-03-09T15:46:46.633
LastEditDate
2012-03-09T15:46:46.633
LastEditorUserId
21234
OwnerUserId
483486
ParentId
0
PostTypeId
1
Score
1
ViewCount
258
LastEditorDisplayName
text
Body
I was writing a simple python script to read from and reconstruct data from a failed RAID5 array that I've been unable to rebuild in any other way. My script is running but slowly. My original script ran at about 80MB/min. I've since improved the script and it's running at 550MB/min but that still seems a bit low. The python script is sitting at 100% CPU, so it seems to be CPU rather than disk limited, which means I have opportunity for optimization. Because the script isn't very long at all I am unable to profile it effectively, so I don't know what's eating it all up. Here's my script as it stands right now (or at least, the important bits) <pre><code>disk0chunk = disk0.read(chunkSize) #disk1 is missing, bad firmware disk2chunk = disk2.read(chunkSize) disk3chunk = disk3.read(chunkSize) if (parityDisk % 4 == 1): #if the parity stripe is on the missing drive output.write(disk0chunk + disk2chunk + disk3chunk) else: #we need to rebuild the data in disk1 # disk0num = map(ord, disk0chunk) #inefficient, old code # disk2num = map(ord, disk2chunk) #inefficient, old code # disk3num = map(ord, disk3chunk) #inefficient, old code disk0num = struct.depack("16384l", disk0chunk) #more efficient new code disk2num = struct.depack("16384l", disk2chunk) #more efficient new code disk3num = struct.depack("16384l", disk3chunk) #more efficient new code magicpotato = zip(disk0num,disk2num,disk3num) disk1num = map(takexor, magicpotato) # disk1bytes = map(chr, disk1num) #inefficient, old code # disk1chunk = ''.join(disk1bytes) #inefficient, old code disk1chunk = struct.pack("16384l", *disk1num) #more efficient new code #output nonparity to based on parityDisk def takexor(magicpotato): return magicpotato[0]^magicpotato[1]^magicpotato[2] </code></pre> Bolding to denote the actual questions inside this giant block of text: Is there anything I can be doing to make this faster/better? If nothing comes to mind, is there anything I can do to better research into what is making this go slowly? (Is there even a way to profile python at a per line level?) Am I even handling this the right way, or is there a better way to handle massive amounts of binary data? The reason I ask is I have a 3TB drive rebuilding and even though it's working correctly (I can mount the image ro,loop and browse files fine) it's taking a long time. I measured it as taking until mid-January with the old code, now it's going to take until Christmas (so it's way better but it's still slower than I expected it to be.) Before you ask, this is an mdadm RAID5 (64kb blocksize, left symmetric) but the mdadm metadata is missing somehow and mdadm does not allow you to reconfigure a RAID5 without rewriting the metadata to the disk, which I am trying to avoid at all costs, I don't want to risk screwing something up and losing data, however remote the possibility may be.
Tags
<python><optimization><binary-data><hard-drive><raid>
Title
What's the most efficient way to process massive amounts of data from a disk using python?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USskaffman
UserOwnerUserId
1. USOmnipotentEntity
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POWhat's the most efficient way to process massive amounts of data from a disk using python?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.