Note that there are some explanatory texts on larger screens.

plurals
  1. POReading records spread across multiple input lines in Python
    primarykey
    data
    text
    <p>I have a highly unstructured file of text data with records that usually span multiple input lines. </p> <ul> <li>Every record has the <strong>fields separated by spaces</strong>, as for normal text, so every field must be recognized by additional info rather than a "csv field separator".</li> <li><strong>Many different records also share the first two fields</strong> which are: <ul> <li>the number of the month day (1 to 31);</li> <li>the first three letters of the Month.</li> </ul></li> <li>But I know that this "special" record with the day-of-month field and month-prefix field is <strong>followed by records</strong> related to the same "timestamp" (<strong>day/month</strong>) that <strong>do not contain that info</strong>.</li> <li>I know for sure that <strong>the third field is related to unstructured</strong> sentences of many words like "operation performed with this tool on that place for this reason"</li> <li>I know that every record can have <strong>one or two numeric fields</strong> as last fields.</li> <li>I also know that <strong>every new record starts with a new line</strong> (both the first record of the day/month and the following records of the same day/month).</li> </ul> <p>So, to summarize, every record should be transformed into a CSV record similar to this structure: DD,MM,Unstructured text bla bla bla,number1,number2</p> <p>An example of the data is the following:</p> <pre><code>&gt; 20 Sep This is the first record, bla bla bla 10.45 &gt; Text unstructured &gt; of the second record bla bla &gt; 406.25 10001 &gt; 6 Oct Text of the third record thatspans on many &gt; lines bla bla bla 60 &gt; 28 Nov Fourth &gt; record &gt; 27.43 &gt; Second record of the &gt; day/month BUT the fifth record of the file 500 90.25 </code></pre> <p>I developed the following parser in Python but I can not figure out how to read multiple lines of the input file to logically treat them as a unique piece of information. I think I should use two loops one inside the other, but I can not deal with loop indexes.</p> <p>Thanks a lot for the help!</p> <pre><code># I need to deal with is_int() and is_float() functions to handle records with 2 numbers # that must be separated by a csv_separator in the output record... import sys days_in_month = range(1,31) months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'] csv_separator = '|' def is_month(s): if s in months_in_year: return True else: return False def is_day_in_month(n_int): try: if int(n_int) in days_in_month: return True else: return False except ValueError: return False #file_in = open('test1.txt','r') file_in = open(sys.argv[1],'r') #file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file counter = 0 for line in file_in: counter = counter + 1 line_arr = line.split() date_str = '' if is_day_in_month(line_arr[0]): if len(line_arr) &gt; 1 and is_month(line_arr[1]): # Date! num_month = months_in_year.index(line_arr[1]) + 1 date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator elif len(line_arr) &gt; 1: # No date, but first number less than 31 (number of days in a month) date_str = ' '.join(line_arr) + csv_separator else: # No date, and there is only a number less than 31 (number of days in a month) date_str = line_arr[0] + csv_separator else: # there is not a date (a generic string, or a number higher than 31) date_str = ' '.join(line_arr) + csv_separator print &gt;&gt; file_out, date_str + csv_separator + 'line_number_' + str(counter) file_in.close() file_out.close() </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload