Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat's the most efficient way to process massive amounts of data from a disk using python?
    primarykey
    data
    text
    <p>I was writing a simple python script to read from and reconstruct data from a failed RAID5 array that I've been unable to rebuild in any other way. My script is running but slowly. My original script ran at about 80MB/min. I've since improved the script and it's running at 550MB/min but that still seems a bit low. The python script is sitting at 100% CPU, so it seems to be CPU rather than disk limited, which means I have opportunity for optimization. Because the script isn't very long at all I am unable to profile it effectively, so I don't know what's eating it all up. Here's my script as it stands right now (or at least, the important bits)</p> <pre><code>disk0chunk = disk0.read(chunkSize) #disk1 is missing, bad firmware disk2chunk = disk2.read(chunkSize) disk3chunk = disk3.read(chunkSize) if (parityDisk % 4 == 1): #if the parity stripe is on the missing drive output.write(disk0chunk + disk2chunk + disk3chunk) else: #we need to rebuild the data in disk1 # disk0num = map(ord, disk0chunk) #inefficient, old code # disk2num = map(ord, disk2chunk) #inefficient, old code # disk3num = map(ord, disk3chunk) #inefficient, old code disk0num = struct.depack("16384l", disk0chunk) #more efficient new code disk2num = struct.depack("16384l", disk2chunk) #more efficient new code disk3num = struct.depack("16384l", disk3chunk) #more efficient new code magicpotato = zip(disk0num,disk2num,disk3num) disk1num = map(takexor, magicpotato) # disk1bytes = map(chr, disk1num) #inefficient, old code # disk1chunk = ''.join(disk1bytes) #inefficient, old code disk1chunk = struct.pack("16384l", *disk1num) #more efficient new code #output nonparity to based on parityDisk def takexor(magicpotato): return magicpotato[0]^magicpotato[1]^magicpotato[2] </code></pre> <p>Bolding to denote the actual questions inside this giant block of text:</p> <p><strong>Is there anything I can be doing to make this faster/better? If nothing comes to mind, is there anything I can do to better research into what is making this go slowly? (Is there even a way to profile python at a per line level?) Am I even handling this the right way, or is there a better way to handle massive amounts of binary data?</strong></p> <p>The reason I ask is I have a 3TB drive rebuilding and even though it's working correctly (I can mount the image ro,loop and browse files fine) it's taking a long time. I measured it as taking until mid-January with the old code, now it's going to take until Christmas (so it's <em>way</em> better but it's still slower than I expected it to be.)</p> <p>Before you ask, this is an mdadm RAID5 (64kb blocksize, left symmetric) but the mdadm metadata is missing somehow and mdadm does not allow you to reconfigure a RAID5 without rewriting the metadata to the disk, which I am trying to avoid at all costs, I don't want to risk screwing something up and losing data, however remote the possibility may be.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload