StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POReading records spread across multiple input lines in Python
primarykey
Id
9673703
data
AcceptedAnswerId
9675149
AnswerCount
2
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2012-03-12T19:42:12.050
FavoriteCount
0
LastActivityDate
2012-03-15T23:52:12.887
LastEditDate
2012-03-13T21:25:29.483
LastEditorUserId
95852
OwnerUserId
1264920
ParentId
0
PostTypeId
1
Score
2
ViewCount
2494
LastEditorDisplayName
text
Body
I have a highly unstructured file of text data with records that usually span multiple input lines. <ul> <li>Every record has the fields separated by spaces, as for normal text, so every field must be recognized by additional info rather than a "csv field separator".</li> <li>Many different records also share the first two fields which are: <ul> <li>the number of the month day (1 to 31);</li> <li>the first three letters of the Month.</li> </ul></li> <li>But I know that this "special" record with the day-of-month field and month-prefix field is followed by records related to the same "timestamp" (day/month) that do not contain that info.</li> <li>I know for sure that the third field is related to unstructured sentences of many words like "operation performed with this tool on that place for this reason"</li> <li>I know that every record can have one or two numeric fields as last fields.</li> <li>I also know that every new record starts with a new line (both the first record of the day/month and the following records of the same day/month).</li> </ul> So, to summarize, every record should be transformed into a CSV record similar to this structure: DD,MM,Unstructured text bla bla bla,number1,number2 An example of the data is the following: <pre><code>> 20 Sep This is the first record, bla bla bla 10.45 > Text unstructured > of the second record bla bla > 406.25 10001 > 6 Oct Text of the third record thatspans on many > lines bla bla bla 60 > 28 Nov Fourth > record > 27.43 > Second record of the > day/month BUT the fifth record of the file 500 90.25 </code></pre> I developed the following parser in Python but I can not figure out how to read multiple lines of the input file to logically treat them as a unique piece of information. I think I should use two loops one inside the other, but I can not deal with loop indexes. Thanks a lot for the help! <pre><code># I need to deal with is_int() and is_float() functions to handle records with 2 numbers # that must be separated by a csv_separator in the output record... import sys days_in_month = range(1,31) months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'] csv_separator = '|' def is_month(s): if s in months_in_year: return True else: return False def is_day_in_month(n_int): try: if int(n_int) in days_in_month: return True else: return False except ValueError: return False #file_in = open('test1.txt','r') file_in = open(sys.argv[1],'r') #file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file counter = 0 for line in file_in: counter = counter + 1 line_arr = line.split() date_str = '' if is_day_in_month(line_arr[0]): if len(line_arr) > 1 and is_month(line_arr[1]): # Date! num_month = months_in_year.index(line_arr[1]) + 1 date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator elif len(line_arr) > 1: # No date, but first number less than 31 (number of days in a month) date_str = ' '.join(line_arr) + csv_separator else: # No date, and there is only a number less than 31 (number of days in a month) date_str = line_arr[0] + csv_separator else: # there is not a date (a generic string, or a number higher than 31) date_str = ' '.join(line_arr) + csv_separator print >> file_out, date_str + csv_separator + 'line_number_' + str(counter) file_in.close() file_out.close() </code></pre>
Tags
<python><parsing><csv><readline><formatter>
Title
Reading records spread across multiple input lines in Python
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USJohn Y
UserOwnerUserId
1. USTPPZ
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POReading records spread across multiple input lines in Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POReading records spread across multiple input lines in Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.