Note that there are some explanatory texts on larger screens.

plurals
  1. POOptimizing removing duplicates in large files in Python
    primarykey
    data
    text
    <p>I have one very large text file(27GB) that I am attempting to make smaller by removing lines that are duplicated in a second database that has several files of a more reasonable size(500MB-2GB). I have some functional code, what I am wondering is is there any way I can optimize this code to run faster, human clock time? At the moment, on a small test run with a 1.5GB input and 500MB filter, this takes 75~ seconds to complete. </p> <p>I've gone through many iterations of this idea, this one is currently best for time, if anyone has ideas for making a better logical structure for the filter I'd love to hear it, past attempts that have all been worse than this one: Loading the filter into a set and cycling through the input searching for duplicates(about half as fast as this), loading the input into a set and running the filter through a difference_update(almost as fast as this but also doing the reverse of what I wanted), and loading both input and filter into sets in chunks and doing set differences(that was a horrible, horrible idea that would work(maybe?) if my filters were smaller so I didn't have to split them, too.)</p> <p>So, those are all the things I've tried. All of these processes max out on CPU, and my final version runs at about 25-50% disk I/O, filter and output are on one physical disk, input is on another. I am running a dual core and have no idea if this particular script can be threaded, never done any multithreading before so if that's a possibility I'd love to be pointed in the right direction.</p> <p>Information about the data! As said earlier, the input is many times larger than the filter. I am expecting a very small percentage of duplication. The data is in lines, all of which are under 20 ASCII characters long. The files are all sorted. </p> <p>I've already changed the order of the three logical statements, based on the expectation that unique input lines will be the majority of the lines, then unique filter, then duplicates, which on the 'best' case of having no duplicates at all, saved me about 10% time.</p> <p>Any suggestions?</p> <pre><code>def sortedfilter(input,filter,output): file_input = open(input,'r') file_filter = open(filter,'r') file_output = open(output,'w') inline = file_input.next() filterline = file_filter.next() try: while inline and filterline: if inline &lt; filterline: file_output.write(inline) inline = file_input.next() continue if inline &gt; filterline: filterline = file_filter.next() continue if inline == filterline: filterline = file_filter.next() inline = file_input.next() except StopIteration: file_output.writelines(file_input.readlines()) finally: file_filter.close() file_input.close() file_output.close() </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload