StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POOptimizing removing duplicates in large files in Python
primarykey
Id
11378152
data
AcceptedAnswerId
11378288
AnswerCount
4
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-07-07T19:50:35.227
FavoriteCount
1
LastActivityDate
2012-07-08T06:17:29.350
LastEditDate
LastEditorUserId
0
OwnerUserId
1493673
ParentId
0
PostTypeId
1
Score
1
ViewCount
954
LastEditorDisplayName
text
Body
I have one very large text file(27GB) that I am attempting to make smaller by removing lines that are duplicated in a second database that has several files of a more reasonable size(500MB-2GB). I have some functional code, what I am wondering is is there any way I can optimize this code to run faster, human clock time? At the moment, on a small test run with a 1.5GB input and 500MB filter, this takes 75~ seconds to complete. I've gone through many iterations of this idea, this one is currently best for time, if anyone has ideas for making a better logical structure for the filter I'd love to hear it, past attempts that have all been worse than this one: Loading the filter into a set and cycling through the input searching for duplicates(about half as fast as this), loading the input into a set and running the filter through a difference_update(almost as fast as this but also doing the reverse of what I wanted), and loading both input and filter into sets in chunks and doing set differences(that was a horrible, horrible idea that would work(maybe?) if my filters were smaller so I didn't have to split them, too.) So, those are all the things I've tried. All of these processes max out on CPU, and my final version runs at about 25-50% disk I/O, filter and output are on one physical disk, input is on another. I am running a dual core and have no idea if this particular script can be threaded, never done any multithreading before so if that's a possibility I'd love to be pointed in the right direction. Information about the data! As said earlier, the input is many times larger than the filter. I am expecting a very small percentage of duplication. The data is in lines, all of which are under 20 ASCII characters long. The files are all sorted. I've already changed the order of the three logical statements, based on the expectation that unique input lines will be the majority of the lines, then unique filter, then duplicates, which on the 'best' case of having no duplicates at all, saved me about 10% time. Any suggestions? <pre><code>def sortedfilter(input,filter,output): file_input = open(input,'r') file_filter = open(filter,'r') file_output = open(output,'w') inline = file_input.next() filterline = file_filter.next() try: while inline and filterline: if inline < filterline: file_output.write(inline) inline = file_input.next() continue if inline > filterline: filterline = file_filter.next() continue if inline == filterline: filterline = file_filter.next() inline = file_input.next() except StopIteration: file_output.writelines(file_input.readlines()) finally: file_filter.close() file_input.close() file_output.close() </code></pre>
Tags
<python><optimization><duplicates><large-files>
Title
Optimizing removing duplicates in large files in Python
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USBriana
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POOptimizing removing duplicates in large files in Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POOptimizing removing duplicates in large files in Python
 UserUserId
 USfork0
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.