Note that there are some explanatory texts on larger screens.

plurals
  1. POPython Pandas -- Random sampling of time series
    primarykey
    data
    text
    <p>New to Pandas, looking for the most efficient way to do this.</p> <p>I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.</p> <p>I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.</p> <p>This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:</p> <p><strong>Code</strong></p> <pre><code>def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): if type(data) != pd.Series: return None if subset=='validate': offset = 0 elif subset=='test': offset = 200 elif subset=='train': offset = 400 tickers = np.random.randint(0, len(data), size=len(data)) ret_data = [] while len(ret_data) != batch_size: for t in tickers: data_t = data[t] max_len = len(data_t)-timesteps-1 if len(ret_data)==batch_size: break if max_len-offset &lt; 0: continue index = np.random.randint(offset, max_len) d = data_t[index:index+timesteps] if len(d)==timesteps: ret_data.append(d) return ret_data </code></pre> <p><strong>Profile output:</strong></p> <pre><code>Timer unit: 1e-06 s File: finance.py Function: random_sample at line 137 Total time: 0.016142 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 137 @profile 138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): 139 1 5 5.0 0.0 if type(data) != pd.Series: 140 return None 141 142 1 1 1.0 0.0 if subset=='validate': 143 offset = 0 144 1 1 1.0 0.0 elif subset=='test': 145 offset = 200 146 1 0 0.0 0.0 elif subset=='train': 147 1 1 1.0 0.0 offset = 400 148 149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data)) 150 151 1 2 2.0 0.0 ret_data = [] 152 2 3 1.5 0.0 while len(ret_data) != batch_size: 153 116 148 1.3 0.9 for t in tickers: 154 116 2497 21.5 15.5 data_t = data[t] 155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1 156 116 80 0.7 0.5 if len(ret_data)==batch_size: break 157 115 69 0.6 0.4 if max_len-offset &lt; 0: continue 158 159 100 101 1.0 0.6 index = np.random.randint(offset, max_len) 160 100 10840 108.4 67.2 d = data_t[index:index+timesteps] 161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d) 162 163 1 1 1.0 0.0 return ret_data </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload