Note that there are some explanatory texts on larger screens.

plurals
  1. POQuestions about pandas: expanding multivalued column, inverting and grouping
    primarykey
    data
    text
    <p>I was looking into pandas to do some simple calculations on NLP and text mining but I couldn't quite grasp how to do them. </p> <p>Suppose I have the following data frame, relating people's names and their gender:</p> <pre><code>import pandas people = {'name': ['John Doe', 'Mary Poppins', 'Jane Doe', 'John Cusack'], 'gender': ['M', 'F', 'F', 'M']} df = pandas.DataFrame(people) </code></pre> <p>For all rows I want to:</p> <ol> <li>determine the first name</li> <li>determine a list of 3-shingles (sequences of 3 letters contained in a word) deriving from the person name</li> <li>determine, for each shingle, how many males and females contained that shingle on their names.</li> </ol> <p>The goal is to use this as a data set to train a classifier which can determine if a given name is probably a male or a female name.</p> <p>The first two operations are quite straightforward:</p> <pre><code>def shingles(word, n = 3): return [word[i:i + n] for i in range(len(word) - n + 1)] df['firstname'] = df.name.map(lambda x : x.split()[0]) df['shingles'] = df.firstname.map(shingles) </code></pre> <p>the result is:</p> <pre><code>&gt; print df gender name firstname shingles 0 M John Doe John ['joh', 'ohn'] 1 F Mary Poppins Mary ['mar', 'ary'] 2 F Jane Doe Jane ['jan', 'ane'] 3 M John Cusack John ['joh', 'ohn'] </code></pre> <p>Now, the next step should be done by constructing a new data frame with two columns: gender and shingle, which should contain something like:</p> <pre><code> gender shingle 0 M joh 1 M ohn 2 F mar 3 F ary (...) </code></pre> <p>And then I could group by shingle and gender. Ideally, the result would be:</p> <pre><code> shingle num_males num_females 0 joh 2 0 1 ohn 2 0 2 mar 0 1 3 ary 0 1 (...) </code></pre> <p>Is there an easy way to expand the multivalued column <code>shingles</code> in a way that each row produces multiple rows, one for each value found in the list of shingles?</p> <p>Also, if I <code>groupby</code> the column <code>shingle</code>, how easy it is to produce different columns with the count for each possible value of the column <code>gender</code>?</p> <hr> <p>I managed to understand the second part. As an example, to calculate how many males and females for each <code>firstname</code>:</p> <pre><code> def countMaleFemale(df): return pandas.Series({'males': df.gender[df.gender == 'M'].count(), 'females': df.gender[df.gender == 'F'].count()}) grouped = df.groupby('first name') </code></pre> <p>And then:</p> <blockquote> <p>print grouped.apply(countMaleFemale)</p> </blockquote> <pre><code> females males first name Jane 1 0 John 0 2 Mary 1 0 </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload