StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p><strong>An update, several years later</strong></p> <p>This answer is old, and R has moved on. Tweaking <a href="https://www.rdocumentation.org/packages/utils/topics/read.table" rel="noreferrer"><code>read.table</code></a> to run a bit faster has precious little benefit. Your options are:</p> <ol> <li><p>Using <a href="https://www.rdocumentation.org/packages/utils/topics/fread" rel="noreferrer"><code>fread</code></a> in <a href="https://cran.r-project.org/web/packages/data.table/index.html" rel="noreferrer"><code>data.table</code></a> for importing data from csv/tab-delimited files directly into R. See <a href="https://stackoverflow.com/a/15058684/134830">mnel's answer</a>.</p></li> <li><p>Using <a href="https://www.rdocumentation.org/packages/readr/topics/read_table" rel="noreferrer"><code>read_table</code></a> in <a href="https://cran.r-project.org/web/packages/readr/index.html" rel="noreferrer"><code>readr</code></a> (on CRAN from April 2015). This works much like <code>fread</code> above. The <em>readme</em> in the link explains the difference between the two functions (<code>readr</code> currently claims to be "1.5-2x slower" than <code>data.table::fread</code>).</p></li> <li><p><a href="https://www.rdocumentation.org/packages/iotools/topics/read.csv.raw" rel="noreferrer"><code>read.csv.raw</code></a> from <a href="https://cran.r-project.org/web/packages/iotools/index.html" rel="noreferrer"><code>iotools</code></a> provides a third option for quickly reading CSV files.</p></li> <li><p>Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) <a href="https://www.rdocumentation.org/packages/sqldf/topics/read.csv.sql" rel="noreferrer"><code>read.csv.sql</code></a> in the <a href="https://cran.r-project.org/web/packages/sqldf/index.html" rel="noreferrer"><code>sqldf</code></a> package, as described in <a href="https://stackoverflow.com/a/1820610/134830">JD Long's answer</a>, imports data into a temporary SQLite database and then reads it into R. See also: the <a href="https://cran.r-project.org/web/packages/RODBC/index.html" rel="noreferrer"><code>RODBC</code></a> package, and the reverse depends section of the <a href="https://cran.r-project.org/web/packages/DBI/index.html" rel="noreferrer"><code>DBI</code> package</a> page. <a href="https://cran.r-project.org/web/packages/MonetDB.R/index.html" rel="noreferrer"><code>MonetDB.R</code></a> gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its <a href="https://www.rdocumentation.org/packages/MonetDB.R/topics/monetdb.read.csv" rel="noreferrer"><code>monetdb.read.csv</code></a> function. <a href="https://cran.r-project.org/web/packages/dplyr/index.html" rel="noreferrer"><code>dplyr</code></a> allows you to work directly with data stored in several types of database.</p></li> <li><p>Storing data in binary formats can also be useful for improving performance. Use <code>saveRDS</code>/<code>readRDS</code> (see below), the <a href="https://cran.rstudio.com/web/packages/h5/index.html" rel="noreferrer"><code>h5</code></a> or <a href="https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html" rel="noreferrer"><code>rhdf5</code></a> packages for HDF5 format, or <code>write_fst</code>/<code>read_fst</code> from the <a href="https://cran.r-project.org/package=fst" rel="noreferrer"><code>fst</code></a> package.</p></li> </ol> <hr> <p><strong>The original answer</strong></p> <p>There are a couple of simple things to try, whether you use read.table or scan.</p> <ol> <li><p>Set <code>nrows</code>=<em>the number of records in your data</em> (<code>nmax</code> in <code>scan</code>).</p></li> <li><p>Make sure that <code>comment.char=""</code> to turn off interpretation of comments.</p></li> <li><p>Explicitly define the classes of each column using <code>colClasses</code> in <code>read.table</code>.</p></li> <li><p>Setting <code>multi.line=FALSE</code> may also improve performance in scan.</p></li> </ol> <p>If none of these thing work, then use one of the <a href="https://cran.r-project.org/web/views/HighPerformanceComputing.html" rel="noreferrer">profiling packages</a> to determine which lines are slowing things down. Perhaps you can write a cut down version of <code>read.table</code> based on the results.</p> <p>The other alternative is filtering your data before you read it into R.</p> <p>Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with <del><a href="https://www.rdocumentation.org/packages/base/topics/save" rel="noreferrer"><code>save</code></a></del> <a href="https://www.rdocumentation.org/packages/base/topics/saveRDS" rel="noreferrer"><code>saveRDS</code></a>, then next time you can retrieve it faster with <del><a href="https://www.rdocumentation.org/packages/base/topics/load" rel="noreferrer"><code>load</code></a></del> <code>readRDS</code>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload