Note that there are some explanatory texts on larger screens.

plurals
  1. POpandas time series join
    text
    copied!<p>I have two data frames one with events that may or may not trigger action. The other represents Actions with a field to show what triggered the action. My goal is to get as many trigger to action durations as I can. <BR> The keys are not under my control and not a perfect equal join where one and only one match is found. My though is to do the join then remove negative durations because I know triggers must precede actions. Then I need to remove all but the lowest duration for each key. See sample code. I want all the action fields ‘other_data’ in the result_df. (Only for the joins of minimum duration) <BR> I know that my result may include some artificially short durations but I suspect due to the nature of the join key reuse that this will happen seldom and do not pose a problem for the rest of the analysis. Previously I had been using arbitrarily selected max durations to toss out erroneous join data but tossing these outliers will be problematic because high outliers are what I am most interested in.</p> <pre><code>import numpy as np import pandas as pd scale = 10000 # can go low like 1000 or high like a million just for dummmy data trigger_df = pd.DataFrame({'seq_key' : np.random.random_integers(scale * .02,scale * .2,scale * 7), 'time_ns' : np.random.random_integers(scale * .04,scale * .4,scale * 7)}) action_df = pd.DataFrame({'seq_key' : np.random.random_integers(scale * .02,scale * .2,scale * .7), 'time_ns' : np.random.random_integers(scale * .04,scale * .4,scale * .7), 'other_data' : np.random.random_integers(scale * .03,scale * .3,scale * .7)}) action_df.drop_duplicates(cols='seq_key', take_last=True) join_df = pd.merge(action_df,trigger_df, on='seq_key', how='inner', suffixes=('_action','_trigger')) join_df['duration_ns'] = join_df.time_ns_action - join_df.time_ns_trigger join_df = join_df[join_df.duration_ns &gt; 0] grouped = join_df.groupby(by = ['seq_key'])['duration_ns'] result_df = pd.DataFrame({'duration_ns' : grouped.min()}).reset_index() print average(result_df.duration_ns) print average(join_df.duration_ns) print result_df ,'sad df wants other_data' </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload