StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to merge only the unique lines from file_a to file_b?
primarykey
Id
5407233
data
AcceptedAnswerId
5422365
AnswerCount
4
ClosedDate
CommentCount
8
CommunityOwnedDate
CreationDate
2011-03-23T15:01:13.447
FavoriteCount
1
LastActivityDate
2011-03-27T11:20:57.217
LastEditDate
2011-03-26T16:03:41.423
LastEditorUserId
605156
OwnerUserId
605156
ParentId
0
PostTypeId
1
Score
2
ViewCount
663
LastEditorDisplayName
text
Body
This question has been asked here in one form or another but not quite the thing I'm looking for. So, this is the situation I shall be having: I already have one file, named <code>file_a</code> and I'm creating another file - <code>file_b</code>. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from <code>file_a</code> to <code>file_b</code> and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn't be more than 10MB in size. What's the most efficient (and fastest) way I can do that? I was thinking something like that, which does the merging alright. <pre><code>#!/usr/bin/env python import os, time, sys # Convert Date/time to epoch def toEpoch(dt): dt_ptrn = '%d/%m/%y %H:%M:%S' return int(time.mktime(time.strptime(dt, dt_ptrn))) # input files o_file = "file_a" c_file = "file_b" n_file = [o_file,c_file] m_file = "merged.file" for x in range(len(n_file)): P = open(n_file[x],"r") output = P.readlines() P.close() # Sort the output, order by 2nd last field #sp_lines = [ line.split('\t') for line in output ] #sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) ) F = open(m_file,'w') #for line in sp_lines: for line in output: if "group_" in line: F.write(line) F.close() </code></pre> But, it's: <ul> <li>not with only the unique lines</li> <li>not sorted (by next to last field) </li> <li>and introduces the 3rd file i.e. <code>m_file</code></li> </ul> Just a side note (long story short): I can't use sorted() here as I'm using v2.3, unfortunately. The input files look like this: <pre><code>On 23/03/11 00:40:03 JobID Group.User Ctime Wtime Status QDate CDate =================================================================================== 430792 group_atlas.pltatl16 0 32 4 02/03/11 21:52:38 02/03/11 22:02:15 430793 group_atlas.atlas084 30 472 4 02/03/11 21:57:43 02/03/11 22:09:35 430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42 430796 group_atlas.atlas084 8 185 4 02/03/11 22:02:38 02/03/11 22:05:46 </code></pre> I tried to use cmp() to sort by the 2nd last field but, I think, it doesn't work just because of the first 3 lines of the input files. Can anyone please help? Cheers!!! <hr> Update 1: For the future reference, as suggested by Jakob, here is the complete script. It worked just fine. <pre><code>#!/usr/bin/env python import os, time, sys from sets import Set as set def toEpoch(dt): dt_ptrn = '%d/%m/%y %H:%M:%S' return int(time.mktime(time.strptime(dt, dt_ptrn))) def yield_lines(fileobj): #I want to discard the headers for i in xrange(3): fileobj.readline() # for line in fileobj: yield line def app(path1, path2): file1 = set(yield_lines(open(path1))) file2 = set(yield_lines(open(path2))) return file1.union(file2) # Input files o_file = "testScript/03" c_file = "03.bak" m_file = "finished.file" print time.strftime('%H:%M:%S', time.localtime()) # Sorting the output, order by 2nd last field sp_lines = [ line.split('\t') for line in app(o_file, c_file) ] sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) ) F = open(m_file,'w') print "No. of lines: ",len(sp_lines) for line in sp_lines: MF = '\t'.join(line) F.write(MF) F.close() </code></pre> It took about 2m:47s to finish for 145244 lines. <pre><code>[testac1@serv07 ~]$ ./uniq-merge.py 17:19:21 No. of lines: 145244 17:22:08 </code></pre> thanks!! <hr> Update 2: Hi eyquem, this is the Error message I get when I run your script(s). From the first script: <pre><code>[testac1@serv07 ~]$ ./uniq-merge_2.py File "./uniq-merge_2.py", line 44 fm.writelines( '\n'.join(v)+'\n' for k,v in output ) ^ SyntaxError: invalid syntax </code></pre> From the second script: <pre><code>[testac1@serv07 ~]$ ./uniq-merge_3.py File "./uniq-merge_3.py", line 24 output = sett(line.rstrip() for line in fa) ^ SyntaxError: invalid syntax </code></pre> Cheers!! <hr> Update 3: The previous one wasn't sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob's version - I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda ... ) to the <code>myList</code> to sort the merged file by the nest to last field. This is the final script. <pre><code>#!/usr/bin/env python import os, time, sys from sets import Set as set def toEpoch(dt): # Convert date/time to epoch dt_ptrn = '%d/%m/%y %H:%M:%S' return int(time.mktime(time.strptime(dt, dt_ptrn))) def yield_lines(fileobj): # Discard the headers (1st 3 lines) for i in xrange(3): fileobj.readline() for line in fileobj: yield line def app(path1, path2): # Remove duplicate lines file1 = set(yield_lines(open(path1))) file2 = set(yield_lines(open(path2))) return file1.union(file2) print time.strftime('%H:%M:%S', time.localtime()) # I/O files o_file = "testScript/03" c_file = "03.bak" m_file = "finished.file" # Convert set into to list myList = list(app(o_file, c_file)) # Sort the list by the date sp_lines = [ line.split('\t') for line in myList ] sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) ) F = open(m_file,'w') print "No. of lines: ",len(sp_lines) # Finally write to the outFile for line in sp_lines: MF = '\t'.join(line) F.write(MF) F.close() </code></pre> There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!! <hr> Update 4: Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones. <pre><code>#!/usr/bin/env python import os, sys, re from sets import Set as sett from time import mktime, strptime, strftime def sorting_merge(o_file, c_file, m_file ): # RegEx for Date/time filed pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d') def kl(lines,pat = pat): # match only the next to last field line = lines.split('\t') line = line[-2] return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S')) output = sett() head = [] # Separate the header & remove the duplicates def rmHead(f_n): f_n.readline() for line1 in f_n: if pat.search(line1): break else: head.append(line1) # line of the header for line in f_n: output.add(line.rstrip()) output.add(line1.rstrip()) f_n.close() fa = open(o_file, 'r') rmHead(fa) fb = open(c_file, 'r') rmHead(fb) # Sorting date-wise output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ] output.sort() fm = open(m_file,'w') # Write to the file & add the header fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1]))) for t,line in output: fm.write(line + '\n') fm.close() c_f = "03_a" o_f = "03_b" sorting_merge(o_f, c_f, 'outfile.txt') </code></pre> This version is much faster - 6.99 sec. for 145244 lines compare to the 2m:47s - then the previous one using <code>lambda a, b: cmp()</code>. Thanks to eyquem for all his support. Cheers!!
Tags
<python><python-2.3>
Title
How to merge only the unique lines from file_a to file_b?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USMacUsers
UserOwnerUserId
1. USMacUsers
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POHow to merge only the unique lines from file_a to file_b?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POHow to merge only the unique lines from file_a to file_b?
 UserUserId
 USJakob Bowyer
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.