Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to merge only the unique lines from file_a to file_b?
    primarykey
    data
    text
    <p>This question has been asked here in one form or another but not quite the thing I'm looking for. So, this is the situation I shall be having: I already have one file, named <code>file_a</code> and I'm creating another file - <code>file_b</code>. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from <code>file_a</code> to <code>file_b</code> and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn't be more than 10MB in size. What's the most efficient (and fastest) way I can do that? </p> <p>I was thinking something like that, which does the merging alright.</p> <pre><code>#!/usr/bin/env python import os, time, sys # Convert Date/time to epoch def toEpoch(dt): dt_ptrn = '%d/%m/%y %H:%M:%S' return int(time.mktime(time.strptime(dt, dt_ptrn))) # input files o_file = "file_a" c_file = "file_b" n_file = [o_file,c_file] m_file = "merged.file" for x in range(len(n_file)): P = open(n_file[x],"r") output = P.readlines() P.close() # Sort the output, order by 2nd last field #sp_lines = [ line.split('\t') for line in output ] #sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) ) F = open(m_file,'w') #for line in sp_lines: for line in output: if "group_" in line: F.write(line) F.close() </code></pre> <p>But, it's:</p> <ul> <li>not with only the unique lines</li> <li>not sorted (by next to last field) </li> <li>and introduces the 3rd file i.e. <code>m_file</code></li> </ul> <p>Just a side note (long story short): I can't use sorted() here as I'm using v2.3, unfortunately. The input files look like this:</p> <pre><code>On 23/03/11 00:40:03 JobID Group.User Ctime Wtime Status QDate CDate =================================================================================== 430792 group_atlas.pltatl16 0 32 4 02/03/11 21:52:38 02/03/11 22:02:15 430793 group_atlas.atlas084 30 472 4 02/03/11 21:57:43 02/03/11 22:09:35 430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42 430796 group_atlas.atlas084 8 185 4 02/03/11 22:02:38 02/03/11 22:05:46 </code></pre> <p>I tried to use cmp() to sort by the 2nd last field but, I think, it doesn't work just because of the first 3 lines of the input files.</p> <p>Can anyone please help? Cheers!!! <hr> <strong>Update 1:</strong> </p> <p>For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.</p> <pre><code>#!/usr/bin/env python import os, time, sys from sets import Set as set def toEpoch(dt): dt_ptrn = '%d/%m/%y %H:%M:%S' return int(time.mktime(time.strptime(dt, dt_ptrn))) def yield_lines(fileobj): #I want to discard the headers for i in xrange(3): fileobj.readline() # for line in fileobj: yield line def app(path1, path2): file1 = set(yield_lines(open(path1))) file2 = set(yield_lines(open(path2))) return file1.union(file2) # Input files o_file = "testScript/03" c_file = "03.bak" m_file = "finished.file" print time.strftime('%H:%M:%S', time.localtime()) # Sorting the output, order by 2nd last field sp_lines = [ line.split('\t') for line in app(o_file, c_file) ] sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) ) F = open(m_file,'w') print "No. of lines: ",len(sp_lines) for line in sp_lines: MF = '\t'.join(line) F.write(MF) F.close() </code></pre> <p>It took about 2m:47s to finish for 145244 lines.</p> <pre><code>[testac1@serv07 ~]$ ./uniq-merge.py 17:19:21 No. of lines: 145244 17:22:08 </code></pre> <p>thanks!! <hr> <strong>Update 2:</strong> </p> <p>Hi eyquem, this is the Error message I get when I run your script(s). </p> <p><em><strong>From the first script:</em></strong></p> <pre><code>[testac1@serv07 ~]$ ./uniq-merge_2.py File "./uniq-merge_2.py", line 44 fm.writelines( '\n'.join(v)+'\n' for k,v in output ) ^ SyntaxError: invalid syntax </code></pre> <p><em><strong>From the second script:</em></strong></p> <pre><code>[testac1@serv07 ~]$ ./uniq-merge_3.py File "./uniq-merge_3.py", line 24 output = sett(line.rstrip() for line in fa) ^ SyntaxError: invalid syntax </code></pre> <p>Cheers!! <hr> <strong>Update 3:</strong></p> <p>The previous one wasn't sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob's version - I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda ... ) to the <code>myList</code> to sort the merged file by the nest to last field. This is the final script.</p> <pre><code>#!/usr/bin/env python import os, time, sys from sets import Set as set def toEpoch(dt): # Convert date/time to epoch dt_ptrn = '%d/%m/%y %H:%M:%S' return int(time.mktime(time.strptime(dt, dt_ptrn))) def yield_lines(fileobj): # Discard the headers (1st 3 lines) for i in xrange(3): fileobj.readline() for line in fileobj: yield line def app(path1, path2): # Remove duplicate lines file1 = set(yield_lines(open(path1))) file2 = set(yield_lines(open(path2))) return file1.union(file2) print time.strftime('%H:%M:%S', time.localtime()) # I/O files o_file = "testScript/03" c_file = "03.bak" m_file = "finished.file" # Convert set into to list myList = list(app(o_file, c_file)) # Sort the list by the date sp_lines = [ line.split('\t') for line in myList ] sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) ) F = open(m_file,'w') print "No. of lines: ",len(sp_lines) # Finally write to the outFile for line in sp_lines: MF = '\t'.join(line) F.write(MF) F.close() </code></pre> <p>There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!! </p> <p><hr> <strong>Update 4:</strong></p> <p>Just for future reference, this is a modified version of <strong><em>eyguem</em></strong>, which works much better and faster then the previous ones. </p> <pre><code>#!/usr/bin/env python import os, sys, re from sets import Set as sett from time import mktime, strptime, strftime def sorting_merge(o_file, c_file, m_file ): # RegEx for Date/time filed pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d') def kl(lines,pat = pat): # match only the next to last field line = lines.split('\t') line = line[-2] return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S')) output = sett() head = [] # Separate the header &amp; remove the duplicates def rmHead(f_n): f_n.readline() for line1 in f_n: if pat.search(line1): break else: head.append(line1) # line of the header for line in f_n: output.add(line.rstrip()) output.add(line1.rstrip()) f_n.close() fa = open(o_file, 'r') rmHead(fa) fb = open(c_file, 'r') rmHead(fb) # Sorting date-wise output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ] output.sort() fm = open(m_file,'w') # Write to the file &amp; add the header fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1]))) for t,line in output: fm.write(line + '\n') fm.close() c_f = "03_a" o_f = "03_b" sorting_merge(o_f, c_f, 'outfile.txt') </code></pre> <p>This version is much faster - 6.99 sec. for 145244 lines compare to the 2m:47s - then the previous one using <code>lambda a, b: cmp()</code>. Thanks to eyquem for all his support. Cheers!!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload