StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POTrying to parse text files in python for data analysis
primarykey
Id
13354725
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
11
CommunityOwnedDate
CreationDate
2012-11-13T02:18:10.747
FavoriteCount
0
LastActivityDate
2012-11-15T04:45:07.257
LastEditDate
2012-11-13T21:09:26.500
LastEditorUserId
3749393
OwnerUserId
3749393
ParentId
0
PostTypeId
1
Score
2
ViewCount
3389
LastEditorDisplayName
text
Body
I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc. The general workflow goes as follows: 1) glob all the files in a directory 2) parse the files because they have metadata 3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS') 4) split the lines that match the tag and load data into hashes 5) do some data analysis 6) make some plots Here is a sample of what I typically do in perl: <pre><code>print"Reading File:\n"; # gets data foreach my $vol ($SmallV, $LargeV) { my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}."; my @files = <$base_name*>; # globs for file names foreach my $f (@files) { # loops through matching files print"... $f\n"; my @split = split(/_/, $f); my $beta = $split[4]; if (!grep{$_ eq $beta} @{$Beta{$vol}}) { # constructs Beta hash push(@{$Beta{$vol}}, $split[4]); } open(IN, "<", "$f") or die "cannot open < $f: $!"; # reads in the file chomp(my @in = <IN>); close IN; my @lines = grep{$_=~/^LOOPS/} @in; # greps for lines with the header LOOPS foreach my $l (@lines) { # loops through matched lines my @split = split(/\s+/, $l); # splits matched lines push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash if (!grep{$_ eq $split[1]} @smearingt) {# fills the smearing time array push(@smearingt, $split[1]); } if (!grep{$_ eq $split[4]} @{$block{$vol}}) {# fills the number of blockings push(@{$block{$vol}}, $split[4]); } } } foreach my $beta (@{$Beta{$vol}}) { foreach my $loop (0,1,2,3,4) { # loops over observables foreach my $b (@{$block{$vol}}) { # beta values foreach my $t (@smearingt) { # and smearing times $avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(@{$val{$vol}{$beta}{$t}{$loop}{$b}}); # to find statistics $err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(@{$val{$vol}{$beta}{$t}{$loop}{$b}}); } } } } } print"File Read in Complete!\n"; </code></pre> My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data. Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this: <pre><code>Alpha 0.5 0.5 0.4 Alpha 0.5 0.5 0.4 LOOPS 0 0 0 2 0.5 1.7800178 LOOPS 0 1 0 2 0.5 0.84488326 LOOPS 0 2 0 2 0.5 0.98365135 LOOPS 0 3 0 2 0.5 1.1638834 LOOPS 0 4 0 2 0.5 1.0438407 LOOPS 0 5 0 2 0.5 0.19081102 POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488 BLOCKING COMPLETED Blocking time 1.474 seconds WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667 WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222 WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722 WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664 WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792 WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793 WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648 </code></pre> What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this: <pre><code>volume mass beta observable t value 1224 0.0 5.6 0 0 1.234 1 1.490 2 1.222 1 0 1.234 1 1.234 2448 0.0 5.7 0 1 1.234 </code></pre> and so on like this: <a href="http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical" rel="nofollow">http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical</a> For those of you who don't understand the perl: The meat and potatoes of what I need is this: <pre><code>push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash </code></pre> What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in. ========== <h1>UPDATE</h1> I have come up with the following code which results in this error: <pre><code>Traceback (most recent call last): File "wflow_2lattice_matching.py", line 39, in <module> index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time']) NameError: name 'MultiIndex' is not defined </code></pre> Code: <pre><code>#!/usr/bin/python from pandas import Series, DataFrame import pandas as pd import glob import re import numpy flavor = 4 mass = 0.0 vol = [] b = [] m_t = [] w_t = [] val = [] #tup_vol = (1224, 1632, 2448) tup_vol = 1224, 1632 for v in tup_vol: filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*') for filename in filelist: print 'Reading filename: '+filename f = open(filename, 'r') junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename) ftext = f.readlines() for line in ftext: if re.match('^WFLOW.*', line): line=line.strip() junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line) vol.append(v) b.append(beta) m_t.append(mont_t) w_t.append(smear_t) val.append(wilson_flow) zipped = zip(vol, beta, m_t, w_t) index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time']) data = Series(val, index=index) </code></pre>
Tags
<python><parsing><numpy><scipy><pandas>
Title
Trying to parse text files in python for data analysis
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USdeltap
UserOwnerUserId
1. USdeltap
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
3. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POTrying to parse text files in python for data analysis
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POTrying to parse text files in python for data analysis
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.