Note that there are some explanatory texts on larger screens.

plurals
  1. POPandas DataFrame + object type + HDF + PyTables 'table'
    text
    copied!<p>(Editing to clarify my application, sorry for any confusion)</p> <p>I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.</p> <p>So an invalid trial produces <code>None</code> and a valid trial can produce <code>[]</code> or <code>[1,2]</code> etc etc.</p> <p>Ideally, I'd like to be able to save this data as a <code>frame_table</code> (call it <code>data</code>). I have another table (call it <code>trials</code>) that is easily converted into a <code>frame_table</code> and which I use as a <code>selector</code> to extract rows (trials). I would then like to pull up by data using <code>select_as_multiple</code>.</p> <p>Right now, I'm saving the <code>data</code> structure as a regular table as I'm using an <code>object</code> array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of <code>data</code>.</p> <p>I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?</p> <p>I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials</p> <p>If I used a database I would make the <code>data</code> a binary blob column. With Pandas my current working solution is to save <code>data</code> as an <code>object</code> array in a regular frame and load it all in and then pull out the relevant indexes based on my <code>trials</code> table.</p> <p>This is slightly inefficient, since I'm reading my whole <code>data</code> table in one go, but it's the most workable/extendable scheme I have come up with. </p> <p>But I welcome most enthusiastically a more canonical solution.</p> <p>Thanks so much for all your time!</p> <p>EDIT: Adding code (Jeff's suggestion)</p> <pre><code>import pandas as pd, numpy mydata = [numpy.empty(n) for n in range(1,11)] df = pd.DataFrame(mydata) In [4]: df Out[4]: 0 0 [1.28822975392e-231] 1 [1.28822975392e-231, -2.31584192385e+77] 2 [1.28822975392e-231, -1.49166823584e-154, 2.12... 3 [1.28822975392e-231, 1.2882298313e-231, 2.1259... 4 [1.28822975392e-231, 1.72723381477e-77, 2.1259... 5 [1.28822975392e-231, 1.49166823584e-154, 1.531... 6 [1.28822975392e-231, -2.68156174706e+154, 2.20... 7 [1.28822975392e-231, -2.68156174706e+154, 2.13... 8 [1.28822975392e-231, -1.3365130604e-315, 2.222... 9 [1.28822975392e-231, -1.33651054067e-315, 2.22... In [5]: df.info() &lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 10 entries, 0 to 9 Data columns (total 1 columns): 0 10 non-null values dtypes: object(1) df.to_hdf('test.h5','data') --&gt; OK df.to_hdf('test.h5','data1',table=True) --&gt; ... TypeError: Cannot serialize the column [0] because its data contents are [mixed] object dtype </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload