StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POmatching and unmatched lines from two large data sets
text
Body
copied!<p>I am using Python 2.4 and am pretty new at Python, programming in general and regular expressions. I have a large module that currently outputs two separate streams(or datasets/files) of lines, stream A and stream B. I am trying to compare stream A to stream B to see if any strings in stream B can be matched within any lines of stream A. I want to return all matching contents and all unmatched contents as two separate objects. Please <strong>see my issue, in bold,</strong> below. Does anyone know how I can overcome this problem or have a best-practices recommendation? </p> <p>So far, I have turned stream B ("realtimes") into a list ("regexes") and converted that list into a group of regular expressions ("combined"), using this code </p> <p><em>please note I am not including all of the code in my module, just the part that I am stuck on</em>:</p> <pre><code>regex = re.compile(r'.*\[(\d{2}:\d{2}:\d{2}\.\d{6})\].*') optsymbx = re.compile(r'\[(\d{2}:\d{2}:\d{2}\.\d{6})\][\s]+(trade),(S|B),(\d{1,}),(\w+)[\s]+([0-9A-Z]+),(\d+\.\d+)') regexes = [] def realtimes(): for x in realtrades(): x = str(x) m = re.match(regex,x) if m: #regexes.append(str(m.groups())) yield str(m.groups()) #make contents of realtimes into group of regular expressions f = open(logfile,'r') for x in realtimes(): regexes.append(x) combined = "(" + ")|(".join(regexes) + ")" </code></pre> <p>Then I look into stream A (lines in f), and check each line against "combined" and one additional regex criteria ("optsymbx"), to see if there is a match or not, as follows:</p> <pre><code># checking if any lines in the logfile match "optsymbx" and any regular expressions wihtin "combined" f = open(logfile,'r') for line in f: m = re.match(combined,line) mopt = re.match(optsymbx,line) if not m: if mopt: print line </code></pre> <p><strong>The issue is that stream A and B are very large. Stream A contains over 100,000 lines and Stream B has several thousand. So, when I turn the contents of Stream B into a group of regular expressions ("combined"), it exceeds a capacity of 100 named groups and I get an error:</strong> Also, I tested and know this works when I reduce the size of the contents of Stream B into less than 100 named groups.</p> <pre><code>Traceback (most recent call last): File "badtrades.py", line 121, in ? m = re.match(combined,line) File "/usr/lib64/python2.4/sre.py", line 129, in match return _compile(pattern, flags).match(string) File "/usr/lib64/python2.4/sre.py", line 225, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib64/python2.4/sre_compile.py", line 506, in compile raise AssertionError( AssertionError: sorry, but this version only supports 100 named groups </code></pre> <p>sample data from combined (derived from stream B):</p> <pre><code> ["('09:50:31.458370',)", **"('09:50:31.458370',)"**, "('09:50:48.343785',)", "('09:50:48.449219',)", "('09:50:48.449219',)", "('09:50:48.449219',)", "('09:50:48.449219',)", "('09:51:01.986971',)", "('09:51:01.986971',)", "('09:51:01.986971',)", "('09:51:34.543147',)", "('09:52:14.688349',)", "('09:52:14.688349',)", "('09:52:14.688349',)", "('09:52:14.688349',)", "('09:52:19.700134',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:53:06.696156',)", "('09:54:39.295261',)", "('09:54:39.295261',)", "('09:54:44.883143',)", "('09:54:44.883143',)", "('09:54:44.883143',)", "('09:54:44.883143',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:17.750226',)", "('09:55:19.767099',)", "('09:55:26.750094',)", "('09:55:26.750094',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.195194',)", "('09:55:29.722747',)", "('09:56:38.809658',)", "('09:56:38.809658',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:57:38.444653',)", "('09:58:37.573746',)", "('09:58:37.573746',)", "('09:58:37.573746',)", "('09:59:02.185210',)", "('09:59:09.245981',)", "('09:59:33.619633',)", "('09:59:33.619633',)", "('09:59:33.619633',)", "('09:59:33.619633',)"] </code></pre> <p>sample data from logfile (stream A):</p> <pre><code>[09:49:52.515951] T,AAPL 130518C00450000,1,32.05 [09:49:53.568816] T,AAPL 130328P00455000,30,1.09 [09:49:53.811441] trade,S,2,AAPL 130328C00470000,4.75 [09:49:53.811447] trade,B,95,AAPL,468.69 -- [09:50:31.241441] T,AAPL 130328P00430000,3,0.08 [09:50:31.385327] T,AAPL 130328P00455000,5,1.10 [09:50:31.385911] T,AAPL 130328P00455000,5,1.10 [09:50:31.458370] trade,B,2,AAPL 130328C00475000,2.80 [09:50:31.458373] trade,S,68,AAPL,468.46 -- [09:50:48.339322] T,AAPL 130328C00485000,8,0.92 [09:50:48.339341] T,AAPL 130328C00485000,1,0.92 [09:50:48.339357] T,AAPL 130328C00485000,9,0.92 [09:50:48.343785] trade,B,2,AAPL 130328C00465000,7.05 [09:50:48.343789] trade,S,118,AAPL,468.19 </code></pre> <p>a match would be:</p> <pre><code>data A: [09:50:31.458370] trade,B,2,AAPL 130328C00475000,2.80 data B: [09:50:31.458370] </code></pre> <p>no match would be:</p> <pre><code>data A: [09:49:53.811441] trade,S,2,AAPL 130328C00470000,4.75 data B: #there is no timestamp from B which matches A </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload