Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>If only the order of the fields may vary, it is possible to process a verification of each line and to automatically adapt the extraction of information to the detected order.<br> I think it would be easy to do so with the help of regex.</p> <p>If not only the order, but the number and nature of fields may vary, I think it would still be possible to do the same, but at the condition to know in advance the possible fields.</p> <p>And the common condition is that the fields must have "personalities" strong enough to be easily distinguishable</p> <p>Without more precise information, nobody can go further, IMO</p> <h3>Monday, 15 August 9:39 GMT+0:00</h3> <p>It seems there is an error in <em>spilp.py</em> :<br> it must be </p> <p><code>with codecs.open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:</code><br> not<br> <code>with open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:</code><br> The latter uses the builtin <strong>open()</strong> which has not the keywords in question</p> <h3>Monday, 15 August 16:10 GMT+0:00</h3> <p>Presently , in the sample file, the fields are in this order:</p> <blockquote> <p>date<br> time<br> s-sitename<br> s-ip<br> cs-method<br> cs-uri-stem<br> cs-uri-query<br> s-port cs-username<br> c-ip<br> cs(User-Agent)<br> sc-status<br> sc-substatus<br> sc-win32-status</p> </blockquote> <p>.</p> <p><strong>Suppose</strong> you want to extract the values of each line in the following order:</p> <blockquote> <p>s-port<br> time<br> date<br> s-sitename<br> s-ip<br> cs(User-Agent)<br> sc-status<br> sc-substatus<br> sc-win32-status<br> c-ip<br> cs-username<br> cs-method<br> cs-uri-stem cs-uri-query</p> </blockquote> <p>to assign them to the following identifiers in the same order:</p> <blockquote> <p>s_port<br> time<br> date<br> s_sitename<br> s_ip<br> cs_user_agent<br> sc_status<br> sc_substatus<br> sc_win32_status<br> c_ip<br> cs_username<br> cs_method<br> cs_uri_stem<br> cs_uri_query</p> </blockquote> <p>doing</p> <pre><code>s_port, time, date, s_sitename, s_ip, cs_user_agent, sc_status, sc_substatus, sc_win32_status, c_ip, cs_username, cs_method, cs_uri_stem, cs_uri_query = line_spliter(line) </code></pre> <p>with a function <strong>line_spliter()</strong></p> <p>I know, I know, what you want is the contrary: to restore the values read in a file in the order they have presently is the file, in case there is a file with a different order than the generic present one.</p> <p>But I take this only as example, in the aim to let the sample file as is. Otherwise I would need to create an other file with different order of values to expose an example. </p> <p>Anyway, the algorithm doesn't depend of the example. It depends of the desired order in which one defines succession of the values that must be obtained to do a correct assignement. </p> <p>In my code , this desired order is set with the object <strong>ref_fields</strong></p> <p>I think that my code and its execution speak themselves to make understand the principle.</p> <pre><code>import re ref_fields = ['s-port', 'time','date', 's-sitename', 's-ip', 'cs(User-Agent)', 'sc-status', 'sc-substatus', 'sc-win32-status', 'c-ip', 'cs-username', 'cs-method', 'cs-uri-stem', 'cs-uri-query'] print 'REF_FIELDS :\n------------\n%s\n' % '\n'.join(ref_fields) ############################################ file_path = 'I:\\sample[1].log' # Path to put here ############################################ with open(file_path, 'r') as log_lines: line = '' while line[0:8]!='#Fields:': line = next(log_lines) # At this point, line is the line containing the fields keywords print 'line of the fields keywords:\n----------------------------\n%r\n' % line found_fields = line.split()[1:] len_found_fields = len(found_fields) regex_extractor = re.compile('[ \t]+'.join(len_found_fields*['([^ \t]+)'])) print 'list found_fields of keywords in the file:\n------------------------------------------\n%s\n' % found_fields print '\nfound_fields == ref_fields is ',found_fields == ref_fields if found_fields == ref_fields: print '\nNORMAL ORDER\n------------' def line_spliter(line): return line.split() else: the_order = [ found_fields.index(fild) + 1 for fild in ref_fields] # the_order is the list of indexes localizing the elements of ref_fields # in the order in which they succeed in the actual line of found fields keywords print '\nSPECIAL ORDER\n-------------\nthe_order == %s\n\n\n======================' % the_order def line_spliter(line): return regex_extractor.match(line).group(*the_order) for i in xrange(1): line = next(log_lines) (s_port, time, date, s_sitename, s_ip, cs_user_agent, sc_status, sc_substatus, sc_win32_status, c_ip, cs_username, cs_method, cs_uri_stem, cs_uri_query) = line_spliter(line) print ('LINE :\n------\n' '%s\n' 'SPLIT LINE :\n--------------\n' '%s\n\n' 'REORDERED SPLIT LINE :\n-------------------------\n' '%s\n\n' 'EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :\n-------------------------------------------\n' 'date == %s\n' 'time == %s\n' 's_port == %s\n' 'c_ip == %s\n\n' '======================') % (line,'\n'.join(line.split()),line_spliter(line),date,time,s_port,c_ip) # ---- split each logline into multiple variables, populate dictionaries and db ---- # def splitLogline(log_line): # needs to be dynamic (for different logging setups) s_port, time, date, s_sitename, s_ip, cs_user_agent, sc_status, sc_substatus, sc_win32_status, c_ip, cs_username, cs_method, cs_uri_stem, cs_uri_query = line_spliter(line) </code></pre> <p>result</p> <pre><code>REF_FIELDS : ------------ s-port time date s-sitename s-ip cs(User-Agent) sc-status sc-substatus sc-win32-status c-ip cs-username cs-method cs-uri-stem cs-uri-query line of the fields keywords: ---------------------------- '#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status \n' list found_fields of keywords in the file: ------------------------------------------ ['date', 'time', 's-sitename', 's-ip', 'cs-method', 'cs-uri-stem', 'cs-uri-query', 's-port', 'cs-username', 'c-ip', 'cs(User-Agent)', 'sc-status', 'sc-substatus', 'sc-win32-status'] found_fields == ref_fields is False SPECIAL ORDER ------------- the_order == [8, 2, 1, 3, 4, 11, 12, 13, 14, 10, 9, 5, 6, 7] ====================== LINE : ------ 2010-01-01 00:00:03 SITENAME 192.168.1.1 GET /news-views.aspx - 80 - 66.249.72.135 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200 0 0 SPLIT LINE : -------------- 2010-01-01 00:00:03 SITENAME 192.168.1.1 GET /news-views.aspx - 80 - 66.249.72.135 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200 0 0 REORDERED SPLIT LINE : ------------------------- ('80', '00:00:03', '2010-01-01', 'SITENAME', '192.168.1.1', 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)', '200', '0', '0\n', '66.249.72.135', '-', 'GET', '/news-views.aspx', '-') EXAMPLE OF SOME CORRECT BINDINGS OBTAINED : ------------------------------------------- date == 2010-01-01 time == 00:00:03 s_port == 80 c_ip == 66.249.72.135 ====================== </code></pre> <p>This code applies only to the case where the fields in a file are shuffled, but in the same number as a normal known list of fields.</p> <p>It may happen other cases, for example less values in a file than there are known and waited fields. If you need more help for these other cases, explain which cases may happen and I'll try to adapt the code.</p> <p>.</p> <p>I think I will have many remarks to do on the code I rapidly read in <em>spilp.py</em> . I 'll write them when I will have time.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload