StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSearch pandas series for value and split series at that value
text
Body
copied!<p>Python 3.3.3 Pandas 0.12.0</p> <p>I have a single column .csv file with hundreds of float values separated by an arbitrary string (the string contains letters edit: <em>and will vary run to run</em>). I'm a pandas beginner, hoping to find a way to load that .csv file and split the float values into two columns at the level of that string.</p> <p>I'm so stuck at the first part (searching for the string) that I haven't yet been able to work on the second, which I thought should be much easier.</p> <p>So far, I've been trying to use <code>raw = pandas.read_csv('myfile.csv', squeeze=True)</code>, then something like <code>raw.str.findall('[a-z]')</code>, but I'm not having much luck. I'd really appreciate if someone could lend a hand. I'm planning to use this process on a number of similar .csv files, so I'd hope to find a fairly automated way of performing the task.</p> <p>Example input.csv:</p> <pre><code>123.4932 239.348 912.098098989 49391.1093 .... This is a fake string that splits the data. .... 1323.4942 2445.34223 914432.4 495391.1093090 </code></pre> <p>Desired eventual DataFrame:</p> <pre><code>Column A Column B 123.4932 1323.4942 239.348 2445.34223 912.098098989 914432.4 49391.1093 495391.1093090 ... ... </code></pre> <p>Thanks again if you can point me in the right direction.</p> <hr> <p>20131123 EDIT: Thank you for the responses thus far. Updated to reflect that the splitting string will not remain constant, hence my statement that I'd been trying to find a solution employing a regex <code>raw.str.findall('[a-z]')</code> instead of using <code>.contains</code>.</p> <p>My solution at this point is to just read the .csv file and split with <code>re</code>, accumulate into lists, and load those into pandas. </p> <pre><code>import pandas as pd import re raw = open('myfile.csv', 'r').read().split('\n') df = pd.DataFrame() keeper = [] counter = 0 # Iterate through the rows. Consecutive rows that can be made into float are accumulated. for row in raw: try: keeper.append(float(row)) except: if keeper: df = pd.concat([df, pd.DataFrame(keeper, columns = [counter] )], axis = 1) counter += 1 keeper = [] # Get the last column, assuming the file hasn't ended on a line # that will trigger the exception in the above loop. if keeper: df = pd.concat([df, pd.DataFrame(keeper, columns = [counter] )], axis = 1) df.describe() </code></pre> <p>Thank you for any further suggestions.</p> <p>20180729 EDIT2: One other possible solution using <a href="https://docs.python.org/3/library/itertools.html#itertools.groupby" rel="nofollow noreferrer"><code>itertools.groupby</code></a>:</p> <pre><code>import io import itertools import re import numpy as np import pandas as pd txt = """123.4932 239.348 912.098098989 49391.1093 This is a fake string that splits the data. 1323.4942 2445.34223 914432.4 495391.1093090 fake again 31323.4942 42445.34223 2914432.4 5495391.1093090 23423432""".splitlines() groups = itertools.groupby( txt, key=lambda x: not re.match('^[\d.]+$', x) ) df = pd.concat( (pd.Series(list(g)) for k, g in groups if not k), axis=1 ) print(df) </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload