StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POEfficiency of transforming counts to percentages and index scores
text
Body
copied!<p>I currently have the following code that produces the desired results I want (<code>Data_Index</code> and <code>Data_Percentages</code>)</p> <pre><code>Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE) Input_Data <- data.frame(head(Input_Data)) Rows <-nrow(Input_Data) Vars <-ncol(Input_Data) - 1 #Total population column TotalCount <- Input_Data[1] #Total population sum TotalCountSum <- sum(TotalCount) Input_Data[1] <- NULL VarNames <- colnames(Input_Data) Data_Per_Row <- c() Data_Index_Row <- c() for (i in 1:Rows) { #Proportion of all areas population found in this row OAPer <- TotalCount[i, ] / TotalCountSum * 100 Data_Per_Col <- c() Data_Index_Col <- c() for(u in 1:Vars) { # For every column value in the selected row # the percentage of that value compared to the # total population (TotalCount) for that row is calculated VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100 # Once the percentage is calculated the index # score is calculated by diving this percentage # by the proportion of the total population in that # area compared to all areas VarIndex <- VarPer / OAPer * 100 # Binds results for all columns in the row Data_Per_Col <- cbind(Data_Per_Col, VarPer) Data_Index_Col <- cbind(Data_Index_Col, VarIndex) } # Binds results for completed row with previously completed rows Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col) Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col) } colnames(Data_Per_Row) <- VarNames colnames(Data_Index_Row) <- VarNames # Changes the index scores to range from -1 to 1 OldRange <- (max(Data_Index_Row) - min(Data_Index_Row)) NewRange <- (1 - -1) Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1 Data_Percentages <- Data_Per_Row # Final outputs Data_Index Data_Percentages </code></pre> <p>The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly. </p> <p>What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer. </p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload