Note that there are some explanatory texts on larger screens.

plurals
  1. POTrying to parse text files in python for data analysis
    primarykey
    data
    text
    <p>I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.</p> <p>The general workflow goes as follows:</p> <p>1) glob all the files in a directory</p> <p>2) parse the files because they have metadata</p> <p>3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS')</p> <p>4) split the lines that match the tag and load data into hashes</p> <p>5) do some data analysis</p> <p>6) make some plots</p> <p>Here is a sample of what I typically do in perl:</p> <pre><code>print"Reading File:\n"; # gets data foreach my $vol ($SmallV, $LargeV) { my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}."; my @files = &lt;$base_name*&gt;; # globs for file names foreach my $f (@files) { # loops through matching files print"... $f\n"; my @split = split(/_/, $f); my $beta = $split[4]; if (!grep{$_ eq $beta} @{$Beta{$vol}}) { # constructs Beta hash push(@{$Beta{$vol}}, $split[4]); } open(IN, "&lt;", "$f") or die "cannot open &lt; $f: $!"; # reads in the file chomp(my @in = &lt;IN&gt;); close IN; my @lines = grep{$_=~/^LOOPS/} @in; # greps for lines with the header LOOPS foreach my $l (@lines) { # loops through matched lines my @split = split(/\s+/, $l); # splits matched lines push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash if (!grep{$_ eq $split[1]} @smearingt) {# fills the smearing time array push(@smearingt, $split[1]); } if (!grep{$_ eq $split[4]} @{$block{$vol}}) {# fills the number of blockings push(@{$block{$vol}}, $split[4]); } } } foreach my $beta (@{$Beta{$vol}}) { foreach my $loop (0,1,2,3,4) { # loops over observables foreach my $b (@{$block{$vol}}) { # beta values foreach my $t (@smearingt) { # and smearing times $avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(@{$val{$vol}{$beta}{$t}{$loop}{$b}}); # to find statistics $err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(@{$val{$vol}{$beta}{$t}{$loop}{$b}}); } } } } } print"File Read in Complete!\n"; </code></pre> <p>My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data.</p> <p>Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this:</p> <pre><code>Alpha 0.5 0.5 0.4 Alpha 0.5 0.5 0.4 LOOPS 0 0 0 2 0.5 1.7800178 LOOPS 0 1 0 2 0.5 0.84488326 LOOPS 0 2 0 2 0.5 0.98365135 LOOPS 0 3 0 2 0.5 1.1638834 LOOPS 0 4 0 2 0.5 1.0438407 LOOPS 0 5 0 2 0.5 0.19081102 POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488 BLOCKING COMPLETED Blocking time 1.474 seconds WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667 WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222 WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722 WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664 WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792 WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793 WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648 </code></pre> <p>What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this:</p> <pre><code>volume mass beta observable t value 1224 0.0 5.6 0 0 1.234 1 1.490 2 1.222 1 0 1.234 1 1.234 2448 0.0 5.7 0 1 1.234 </code></pre> <p>and so on like this: <a href="http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical" rel="nofollow">http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical</a></p> <p>For those of you who don't understand the perl:</p> <p>The meat and potatoes of what I need is this:</p> <pre><code>push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash </code></pre> <p>What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in.</p> <p>==========</p> <h1>UPDATE</h1> <p>I have come up with the following code which results in this error:</p> <pre><code>Traceback (most recent call last): File "wflow_2lattice_matching.py", line 39, in &lt;module&gt; index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time']) NameError: name 'MultiIndex' is not defined </code></pre> <p>Code:</p> <pre><code>#!/usr/bin/python from pandas import Series, DataFrame import pandas as pd import glob import re import numpy flavor = 4 mass = 0.0 vol = [] b = [] m_t = [] w_t = [] val = [] #tup_vol = (1224, 1632, 2448) tup_vol = 1224, 1632 for v in tup_vol: filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*') for filename in filelist: print 'Reading filename: '+filename f = open(filename, 'r') junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename) ftext = f.readlines() for line in ftext: if re.match('^WFLOW.*', line): line=line.strip() junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line) vol.append(v) b.append(beta) m_t.append(mont_t) w_t.append(smear_t) val.append(wilson_flow) zipped = zip(vol, beta, m_t, w_t) index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time']) data = Series(val, index=index) </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload