StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython Pandas -- Random sampling of time series
primarykey
Id
13239297
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2012-11-05T19:51:07.997
FavoriteCount
0
LastActivityDate
2012-11-05T21:18:51.330
LastEditDate
LastEditorUserId
0
OwnerUserId
544178
ParentId
0
PostTypeId
1
Score
4
ViewCount
2324
LastEditorDisplayName
text
Body
New to Pandas, looking for the most efficient way to do this. I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance. I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different. This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it: Code <pre><code>def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): if type(data) != pd.Series: return None if subset=='validate': offset = 0 elif subset=='test': offset = 200 elif subset=='train': offset = 400 tickers = np.random.randint(0, len(data), size=len(data)) ret_data = [] while len(ret_data) != batch_size: for t in tickers: data_t = data[t] max_len = len(data_t)-timesteps-1 if len(ret_data)==batch_size: break if max_len-offset < 0: continue index = np.random.randint(offset, max_len) d = data_t[index:index+timesteps] if len(d)==timesteps: ret_data.append(d) return ret_data </code></pre> Profile output: <pre><code>Timer unit: 1e-06 s File: finance.py Function: random_sample at line 137 Total time: 0.016142 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 137 @profile 138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): 139 1 5 5.0 0.0 if type(data) != pd.Series: 140 return None 141 142 1 1 1.0 0.0 if subset=='validate': 143 offset = 0 144 1 1 1.0 0.0 elif subset=='test': 145 offset = 200 146 1 0 0.0 0.0 elif subset=='train': 147 1 1 1.0 0.0 offset = 400 148 149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data)) 150 151 1 2 2.0 0.0 ret_data = [] 152 2 3 1.5 0.0 while len(ret_data) != batch_size: 153 116 148 1.3 0.9 for t in tickers: 154 116 2497 21.5 15.5 data_t = data[t] 155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1 156 116 80 0.7 0.5 if len(ret_data)==batch_size: break 157 115 69 0.6 0.4 if max_len-offset < 0: continue 158 159 100 101 1.0 0.6 index = np.random.randint(offset, max_len) 160 100 10840 108.4 67.2 d = data_t[index:index+timesteps] 161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d) 162 163 1 1 1.0 0.0 return ret_data </code></pre>
Tags
<python><pandas>
Title
Python Pandas -- Random sampling of time series
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USDave S
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPython Pandas -- Random sampling of time series
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POPython Pandas -- Random sampling of time series
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POPython Pandas -- Random sampling of time series
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COClarification: The number of dataframes returned is random and the start date for each dataframe's daterange is random. But the number of timesteps is passed to the function so each dataframe will have the same number of rows. Correct?
 singulars
 PostPostId
 POPython Pandas -- Random sampling of time series
 UserUserId
 USAman
2. COCorrect. And the number of dataframes returned should always be batch_size, it just doesn't matter which stocks are used and how many times each stock is sampled--but it should all be random, so it can't just be a fixed window of stocks each time, for instance.
 singulars
 PostPostId
 POPython Pandas -- Random sampling of time series
 UserUserId
 USDave S

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.