Note that there are some explanatory texts on larger screens.

plurals
  1. PORemove duplicate rows from a large file in Python
    primarykey
    data
    text
    <p>I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way. </p> <p>Each row contains 15 fields and several hundred characters, and all fields are needed to determine uniqueness. Instead of comparing the entire row to find a duplicate, I'm comparing <code>hash(row-as-a-string)</code> in an attempt to save memory. I set a filter that partitions the data into a roughly equal number of rows (e.g. days of the week), and each partition is small enough that a lookup table of hash values for that partition will fit in memory. I pass through the file once for each partition, checking for unique rows and writing them out to a second file (pseudo code):</p> <pre><code>import csv headers={'DayOfWeek':None, 'a':None, 'b':None} outs=csv.DictWriter(open('c:\dedupedFile.csv','wb') days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun'] outs.writerows(headers) for day in days: htable={} ins=csv.DictReader(open('c:\bigfile.csv','rb'),headers) for line in ins: hvalue=hash(reduce(lambda x,y:x+y,line.itervalues())) if line['DayOfWeek']==day: if hvalue in htable: pass else: htable[hvalue]=None outs.writerow(line) </code></pre> <p>One way I was thinking to speed this up is by finding a better filter to reduce the number of passes necessary. Assuming the length of the rows is uniformly distributed, maybe instead of </p> <pre><code>for day in days: </code></pre> <p>and </p> <pre><code>if line['DayOfWeek']==day: </code></pre> <p>we have </p> <pre><code>for i in range(n): </code></pre> <p>and</p> <pre><code>if len(reduce(lambda x,y:x+y,line.itervalues())%n)==i: </code></pre> <p>where 'n' as small as memory will allow. But this is still using the same method.</p> <p><a href="https://stackoverflow.com/users/344286/wayne-werner" title="Wayne Werner">Wayne Werner</a> provided a good practical solution below; I was curious if there was better/faster/simpler way to do this from an algorithm perspective.</p> <p>P.S. I'm limited to Python 2.5.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload