StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>you have asked quite a broad question, but I will try and be as precise as I can. But a note of caution: every statistical analysis method has a series of assumptions that are implicit. This means that if you rely on the results of a statistical model without understanding the limitations of the analysis, you could quite easily make the wrong conclusion.</p> <p>It is also not quite clear to me what you mean by classification. If somebody asked me to do a classification analysis, I would probably consider things like cluster analysis, factor analysis or latent class analysis. there are some variants of linear regression modelling that could also be applicable.</p> <p>That said, here is how you should go about doing a linear regression using your data.</p> <p>First, replicate your sample data:</p> <pre><code>dat <- structure(list(B = c(1L, 0L, 1L, 1L, 1L), T = c(1L, 0L, 0L, 1L, 0L), H = c(1L, 0L, 0L, 1L, 1L), G = c(0L, 1L, 1L, 1L, 0L), S = c(1L, 1L, 0L, 1L, 1L), Z = c(0L, 0L, 0L, 0L, 1L)), .Names = c("B", "T", "H", "G", "S", "Z"), class = "data.frame", row.names = c("Golf", "Football", "Hockey", "Golf2", "Snooker")) dat B T H G S Z Golf 1 1 1 0 1 0 Football 0 0 0 1 1 0 Hockey 1 0 0 1 0 0 Golf2 1 1 1 1 1 0 Snooker 1 0 1 0 1 1 </code></pre> <p>Next, add the expected values:</p> <pre><code>dat$expected <- c(1,2,3,1,4) dat B T H G S Z expected Golf 1 1 1 0 1 0 1 Football 0 0 0 1 1 0 2 Hockey 1 0 0 1 0 0 3 Golf2 1 1 1 1 1 0 1 Snooker 1 0 1 0 1 1 4 </code></pre> <p>finally, we can start the analysis. Fortunately, <code>lm</code> has a shortcut mechanism to tell it to use all of the columns in your data frame. To do this use the following formula: <code>expected~.</code> :</p> <pre><code>fit <- lm(expected~., dat) summary(fit) Call: lm(formula = expected ~ ., data = dat) Residuals: ALL 5 residuals are 0: no residual degrees of freedom! Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 2.00e+00 NA NA NA B 1.00e+00 NA NA NA T -3.00e+00 NA NA NA H 1.00e+00 NA NA NA G -4.71e-16 NA NA NA S NA NA NA NA Z NA NA NA NA Residual standard error: NaN on 0 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 4 and 0 DF, p-value: NA </code></pre> <p>And the last word of caution. Since your sample data contained fewer rows than columns, the linear regression model has insufficient data to function. So in this case it simply discarded the last two columns. Your brief description of your data seems to indicate that you have far more rows and columns, so it ought not to be a problem for you.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload