Note that there are some explanatory texts on larger screens.

plurals
  1. POPython Pandas - returning results of groupby function back to parent table
    primarykey
    data
    text
    <p>[Using Python3] I'm using pandas to read a csv file, group the dataframe, apply a function to the grouped data and add these results back to the original dataframe.</p> <p>My input looks like this:</p> <pre><code>email cc timebucket total_value john@john.com us 1 110.50 example@example.com uk 3 208.84 ... ... ... ... </code></pre> <p>Basically I'm trying to group by <code>cc</code> and calculate the percentile rank for each value in <code>total_value</code> within that group. Secondly I want to apply a flow statement to these results. I need these results to be added back to the original/parent DataFrame. Such that it would look something like this:</p> <pre><code>email cc timebucket total_value percentrank rankbucket john@john.com us 1 110.50 48.59 mid50 example@example.com uk 3 208.84 99.24 top25 ... ... ... ... ... ... </code></pre> <p>The code below gives me an <code>AssertionError</code> and I cannot figure out why. I'm very new to Python and pandas, so that might explain one and another.</p> <p>Code:</p> <pre><code>import pandas as pd import numpy as np from scipy.stats import rankdata def percentilerank(frame, groupkey='cc', rankkey='total_value'): from pandas.compat.scipy import percentileofscore # Technically the below percentileofscore function should do the trick but I cannot # get that to work, hence the alternative below. It would be great if the answer would # include both so that I can understand why one works and the other doesnt. # func = lambda x, score: percentileofscore(x[rankkey], score, kind='mean') func = lambda x: (rankdata(x.total_value)-1)/(len(x.total_value)-1)*100 frame['percentrank'] = frame.groupby(groupkey).transform(func) def calc_and_write(filename): """ Function reads the file (must be tab-separated) and stores in a pandas DataFrame. Next, the percentile rank score based is calculated based on total_value and is done so within a country. Secondly, based on the percentile rank score (prs) a row is assigned to one of three buckets: rankbucket = 'top25' if prs &gt; 75 rankbucket = 'mid50' if 25 &gt; prs &lt; 75 rankbucket = 'bottom25' if prs &lt; 25 """ # Define headers for pandas to read in DataFrame, stored in a list headers = [ 'email', # 0 'cc', # 1 'last_trans_date', # 3 'timebucket', # 4 'total_value', # 5 ] # Reading csv file in chunks and creating an iterator (is supposed to be much faster than reading at once) tp = pd.read_csv(filename, delimiter='\t', names=headers, iterator=True, chunksize=50000) # Concatenating the chunks and sorting total DataFrame by booker_cc and total_nett_spend df = pd.concat(tp, ignore_index=True).sort(['cc', 'total_value'], ascending=False) percentilerank(df) </code></pre> <p>Edit: As requested, this is the traceback log:</p> <pre><code>Traceback (most recent call last): File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 85, in &lt;module&gt; print(calc_and_write('tsv/test.tsv')) File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 74, in calc_and_write percentilerank(df) File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 33, in percentilerank frame['percentrank'] = frame.groupby(groupkey).transform(func) File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1844, in transform axis=self.axis, verify_integrity=False) File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 894, in concat verify_integrity=verify_integrity) File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 964, in __init__ self.new_axes = self._get_new_axes() File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 1124, in _get_new_axes assert(len(self.join_axes) == ndim - 1) AssertionError </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload