Note that there are some explanatory texts on larger screens.

plurals
  1. POReindexing pandas timeseries from object dtype to datetime dtype
    primarykey
    data
    text
    <p>I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.</p> <p>I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.</p> <pre><code>In [1]: df = pd.read_csv('data.csv',index_col=0) In [2]: print df['2008-02-27':'2008-03-02'] Out[2]: count 2008-02-27 20 2008-02-28 0 2008-02-29 27 2008-02-30 0 2008-02-31 0 2008-03-01 0 2008-03-02 17 In [3]: def clean_timestamps(df): # remove invalid dates like '2008-02-30' and '2009-04-31' to_drop = list() for d in df.index: try: datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10])) except ValueError: to_drop.append(d) df2 = df.drop(to_drop,axis=0) return df2 In [4]: df2 = clean_timestamps(df) In [5] :print df2['2008-02-27':'2008-03-02'] Out[5]: count 2008-02-27 20 2008-02-28 0 2008-02-29 27 2008-03-01 0 2008-03-02 17 </code></pre> <p>This new index is still only recognized as a 'object' dtype rather than a DatetimeIndex. </p> <pre><code>In [6]: df2.index Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28, 2012-11-29], dtype=object) </code></pre> <p>Reindexing produces NaNs because they're different dtypes.</p> <pre><code>In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index)) In [8]: df3 = df2.reindex(index=i,columns=['count']) In [9]: df3['2008-02-27':'2008-03-02'] Out[9]: count 2008-02-27 NaN 2008-02-28 NaN 2008-02-29 NaN 2008-03-01 NaN 2008-03-02 NaN </code></pre> <p>I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).</p> <pre><code>In [10]: df3 = pd.DataFrame(columns=['count'],index=i) In [11]: values = dict(df2['count']) In [12]: for d in i: try: df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]]) except KeyError: pass In [13]: print df3['2008-02-27':'2008-03-02'] Out[13]: count 2008-02-27 20 2008-02-28 0 2008-02-29 27 2008-03-01 0 2008-03-02 17 In [14]: df3.index Out[14]; &lt;class 'pandas.tseries.index.DatetimeIndex'&gt; [2008-01-01 00:00:00, ..., 2012-11-29 00:00:00] Length: 1795, Freq: D, Timezone: None </code></pre> <p>This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload