StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython Pandas - returning results of groupby function back to parent table
primarykey
Id
17549596
data
AcceptedAnswerId
17551924
AnswerCount
1
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2013-07-09T13:34:18.733
FavoriteCount
1
LastActivityDate
2013-07-09T15:15:13.197
LastEditDate
2013-07-09T13:45:27.930
LastEditorUserId
2445114
OwnerUserId
2445114
ParentId
0
PostTypeId
1
Score
1
ViewCount
1438
LastEditorDisplayName
text
Body
[Using Python3] I'm using pandas to read a csv file, group the dataframe, apply a function to the grouped data and add these results back to the original dataframe. My input looks like this: <pre><code>email cc timebucket total_value john@john.com us 1 110.50 example@example.com uk 3 208.84 ... ... ... ... </code></pre> Basically I'm trying to group by <code>cc</code> and calculate the percentile rank for each value in <code>total_value</code> within that group. Secondly I want to apply a flow statement to these results. I need these results to be added back to the original/parent DataFrame. Such that it would look something like this: <pre><code>email cc timebucket total_value percentrank rankbucket john@john.com us 1 110.50 48.59 mid50 example@example.com uk 3 208.84 99.24 top25 ... ... ... ... ... ... </code></pre> The code below gives me an <code>AssertionError</code> and I cannot figure out why. I'm very new to Python and pandas, so that might explain one and another. Code: <pre><code>import pandas as pd import numpy as np from scipy.stats import rankdata def percentilerank(frame, groupkey='cc', rankkey='total_value'): from pandas.compat.scipy import percentileofscore # Technically the below percentileofscore function should do the trick but I cannot # get that to work, hence the alternative below. It would be great if the answer would # include both so that I can understand why one works and the other doesnt. # func = lambda x, score: percentileofscore(x[rankkey], score, kind='mean') func = lambda x: (rankdata(x.total_value)-1)/(len(x.total_value)-1)*100 frame['percentrank'] = frame.groupby(groupkey).transform(func) def calc_and_write(filename): """ Function reads the file (must be tab-separated) and stores in a pandas DataFrame. Next, the percentile rank score based is calculated based on total_value and is done so within a country. Secondly, based on the percentile rank score (prs) a row is assigned to one of three buckets: rankbucket = 'top25' if prs > 75 rankbucket = 'mid50' if 25 > prs < 75 rankbucket = 'bottom25' if prs < 25 """ # Define headers for pandas to read in DataFrame, stored in a list headers = [ 'email', # 0 'cc', # 1 'last_trans_date', # 3 'timebucket', # 4 'total_value', # 5 ] # Reading csv file in chunks and creating an iterator (is supposed to be much faster than reading at once) tp = pd.read_csv(filename, delimiter='\t', names=headers, iterator=True, chunksize=50000) # Concatenating the chunks and sorting total DataFrame by booker_cc and total_nett_spend df = pd.concat(tp, ignore_index=True).sort(['cc', 'total_value'], ascending=False) percentilerank(df) </code></pre> Edit: As requested, this is the traceback log: <pre><code>Traceback (most recent call last): File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 85, in <module> print(calc_and_write('tsv/test.tsv')) File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 74, in calc_and_write percentilerank(df) File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 33, in percentilerank frame['percentrank'] = frame.groupby(groupkey).transform(func) File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1844, in transform axis=self.axis, verify_integrity=False) File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 894, in concat verify_integrity=verify_integrity) File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 964, in __init__ self.new_axes = self._get_new_axes() File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 1124, in _get_new_axes assert(len(self.join_axes) == ndim - 1) AssertionError </code></pre>
Tags
<python><csv><python-3.x><pandas>
Title
Python Pandas - returning results of groupby function back to parent table
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USMatthijs
UserOwnerUserId
1. USMatthijs
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.