StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:</p> <ul> <li><p>as far as I know, regexes have been invented for that</p></li> <li><p>iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex</p></li> <li><p>way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too</p></li> </ul> <p>For the problem discussed here, a regex solution is fast and easy to write:</p> <pre><code>import re names = re.findall('\S+',open(filename).read()) </code></pre> <p>I compared the speeds of several solutions:</p> <pre><code>import re from time import clock A,AA,B1,B2,BS,reg = [],[],[],[],[],[] D,Dsh,C1,C2 = [],[],[],[] F1,F2,F3 = [],[],[] def nonblank_lines(f): for l in f: line = l.rstrip() if line: yield line def short_nonblank_lines(f): for l in f: line = l[0:-1] if line: yield line for essays in xrange(50): te = clock() with open('raa.txt') as f: names_listA = [line.strip() for line in f if line.strip()] # Felix Kling A.append(clock()-te) te = clock() with open('raa.txt') as f: names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1] AA.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list() B1.append(clock()-te) te = clock() with open('raa.txt') as f_in: namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1] B2.append(clock()-te) te = clock() with open('raa.txt') as f_in: namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines() BS.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f: xreg = re.findall('\S+',f.read()) # eyquem reg.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling C1.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling with line[0:-1] C2.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: yD = [ line for line in nonblank_lines(f_in) ] # aaronasterling update D.append(clock()-te) te = clock() with open('raa.txt') as f_in: yDsh = [ name for name in short_nonblank_lines(f_in) ] # nonblank_lines with line[0:-1] Dsh.append(clock()-te) #------------------------------------------------------- te = clock() with open('raa.txt') as f_in: linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2 F1.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1] F2.append(clock()-te) te = clock() with open('raa.txt') as f_in: linesF3 = filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines() F3.append(clock()-te) print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n is ',\ names_listA == names_listAA==namesB1==namesB2==namesBS==xreg print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n is ',\ names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n' def displ((fr,it,what)): print fr + str( min(it) )[0:7] + ' ' + what map(displ,(('* ', A, '[line.strip() for line in f if line.strip()] * Felix Kling\n'), (' ', B1, ' [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()'), ('* ', C1, 'list(line for line in (l.strip() for l in f_in) if line) * aaronasterling\n'), ('* ', reg, 're.findall("\S+",f.read()) * eyquem\n'), ('* ', D, '[ line for line in nonblank_lines(f_in) ] * aaronasterling update'), (' ', Dsh, '[ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]\n'), ('* ', F1 , 'filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2\n'), (' ', B2, ' [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]'), (' ', C2, 'list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]\n'), (' ', AA, '[line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]\n'), (' ', BS, '[name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()\n'), (' ', F2 , 'filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]'), (' ', F3 , 'filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()')) ) </code></pre> <p>Solution with regex is straightforward and neat. Though, it isn't among the fastest ones. The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:</p> <pre><code>names_listA == names_listAA==namesB1==namesB2==namesBS==xreg is True names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3 is True * 0.08266 [line.strip() for line in f if line.strip()] * Felix Kling 0.07535 [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list() * 0.06912 list(line for line in (l.strip() for l in f_in) if line) * aaronasterling * 0.06612 re.findall("\S+",f.read()) * eyquem * 0.06486 [ line for line in nonblank_lines(f_in) ] * aaronasterling update 0.05264 [ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1] * 0.05451 filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2 0.04689 [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1] 0.04582 list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1] 0.04171 [line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1] 0.03265 [name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines() 0.03638 filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1] 0.02198 filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines() </code></pre> <p>But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.</p> <p>On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.</p> <p>EDIT</p> <p>I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain</p> <pre><code>SMITH JONES WILLIAMS TAYLOR BROWN DAVIES EVANS WILSON THOMAS JOHNSON ROBERTS ROBINSON THOMPSON WRIGHT WALKER WHITE EDWARDS HUGHES GREEN HALL LEWIS HARRIS CLARKE PATEL JACKSON WOOD TURNER MARTIN COOPER HILL WARD MORRIS MOORE CLARK LEE KING BAKER HARRISON MORGAN ALLEN JAMES SCOTT PHILLIPS WATSON DAVIS PARKER PRICE BENNETT YOUNG GRIFFITHS MITCHELL KELLY COOK CARTER RICHARDSON BAILEY COLLINS BELL SHAW MURPHY MILLER COX RICHARDS KHAN MARSHALL ANDERSON SIMPSON ELLIS ADAMS SINGH BEGUM WILKINSON FOSTER CHAPMAN POWELL WEBB ROGERS GRAY MASON ALI HUNT HUSSAIN CAMPBELL MATTHEWS OWEN PALMER HOLMES MILLS BARNES KNIGHT LLOYD BUTLER RUSSELL BARKER FISHER STEVENS JENKINS MURRAY DIXON HARVEY </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload