Note that there are some explanatory texts on larger screens.

plurals
  1. POHow can I structure and recode messy categorical data in R?
    text
    copied!<p>I'm struggling with how to best structure categorical data that's messy, and comes from a <a href="http://docs.google.com/leaf?id=0BypDSvtB33v7MjQ0OTBmYjQtMjBlYy00MDJjLWJmZmQtMjkxM2VhMDNmMGFl&amp;hl=en" rel="nofollow noreferrer">dataset</a> I'll need to clean. </p> <h3>The Coding Scheme</h3> <p>I'm analyzing data from a university science course exam. We're looking at patterns in student responses, and we developed a coding scheme to represent the kinds of things students are doing in their answers. A subset of the coding scheme is shown below.</p> <p><a href="http://picasaweb.google.com/lh/photo/0tut3kR-JFoB0cP_0uFBZg?feat=embedwebsite" rel="nofollow noreferrer"><img src="http://lh5.ggpht.com/_TvbgXH78cQc/S-CknUrDwuI/AAAAAAAACGo/0umklpa2968/s400/StackOverflowQuestion20100504.001.png" /></a></p> <p>Note that within each major code (1, 2, 3) are nested non-unique sub-codes (a, b, ...).</p> <h3>What the Raw Data Looks Like</h3> <p>I've created an anonymized, raw subset of my actual data which you can view <a href="http://docs.google.com/leaf?id=0BypDSvtB33v7MjQ0OTBmYjQtMjBlYy00MDJjLWJmZmQtMjkxM2VhMDNmMGFl&amp;hl=en" rel="nofollow noreferrer">here</a>. Part of my problem is that those who coded the data noticed that some students displayed multiple patterns. The coders' solution was to create enough columns (<code>reason1</code>, <code>reason2</code>, ...) to hold students with multiple patterns. That becomes important because the order (<code>reason1</code>, <code>reason2</code>) is arbitrary--two students (like student 41 and student 42 in my <a href="http://docs.google.com/leaf?id=0BypDSvtB33v7MjQ0OTBmYjQtMjBlYy00MDJjLWJmZmQtMjkxM2VhMDNmMGFl&amp;hl=en" rel="nofollow noreferrer">dataset</a>) who correctly applied "dependency" should both register in an analysis, regardless of whether <code>3a</code> appears in the <code>reason</code> column or the <code>reason2</code> column.</p> <h3>How Can I Best Structure Student Data?</h3> <p>Part of my problem is that in the <a href="http://docs.google.com/leaf?id=0BypDSvtB33v7MjQ0OTBmYjQtMjBlYy00MDJjLWJmZmQtMjkxM2VhMDNmMGFl&amp;hl=en" rel="nofollow noreferrer">raw data</a>, not all students display the same patterns, or the same number of them, in the same order. Some students may do just one thing, others may do several. So, an abstracted representation of example students might look like this:</p> <p><a href="http://picasaweb.google.com/lh/photo/sQgGKgseA07Z_lKxRe4fkQ?feat=embedwebsite" rel="nofollow noreferrer"><img src="http://lh3.ggpht.com/_TvbgXH78cQc/S-CknlLcnqI/AAAAAAAACGs/M5oK9nMELvc/s400/StackOverflowQuestion20100504.002.png" /></a></p> <p>Note in the example above that <code>student002</code> and <code>student003</code> both are coded as "1b", although I've deliberately shown the order as different to reflect the reality of <a href="http://docs.google.com/leaf?id=0BypDSvtB33v7MjQ0OTBmYjQtMjBlYy00MDJjLWJmZmQtMjkxM2VhMDNmMGFl&amp;hl=en" rel="nofollow noreferrer">my data</a>.</p> <h3>My (Practical) Questions</h3> <ol> <li>Should I concatenate <code>reason1</code>, <code>reason2</code>, <code>...</code> into one column?</li> <li>How can I (re)code the <code>reason</code>s in R to reflect the multiplicity for some students?</li> </ol> <h3>Thanks</h3> <p>I realize this question is as much about good data conceptualization as it is about specific features of R, but I thought it would be appropriate to ask it here. If you feel it's inappropriate for me to ask the question, please let me know in the comments, and stackoverflow will automatically flood my inbox with sadface emoticons. If I haven't been specific enough, please let me know and I'll do my best to be clearer.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload