Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>In general, you should try to use a vectorized function to begin with. Using <code>strsplit</code> will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use <code>nchar</code> instead:</p> <pre><code>&gt; nchar(words) [1] 1 5 5 3 </code></pre> <p>More generally, take advantage of the fact that <code>strsplit</code> returns a list and use <code>lapply</code>:</p> <pre><code>&gt; as.numeric(lapply(strsplit(words,""), length)) [1] 1 5 5 3 </code></pre> <p>Or else use an <code>l*ply</code> family function from <code>plyr</code>. For instance:</p> <pre><code>&gt; laply(strsplit(words,""), length) [1] 1 5 5 3 </code></pre> <p><em>Edit:</em></p> <p>In honor of <a href="http://en.wikipedia.org/wiki/Bloomsday" rel="noreferrer"><strong>Bloomsday</strong></a>, I decided to test the performance of these approaches using Joyce's Ulysses:</p> <pre><code>joyce &lt;- readLines("http://www.gutenberg.org/files/4300/4300-8.txt") joyce &lt;- unlist(strsplit(joyce, " ")) </code></pre> <p>Now that I have all the words, we can do our counts:</p> <pre><code>&gt; # original version &gt; system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]]))))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 2.65 0.03 2.73 &gt; # vectorized function &gt; system.time(print(summary(nchar(joyce)))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 0.05 0.00 0.04 &gt; # with lapply &gt; system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length))))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 0.8 0.0 0.8 &gt; # with laply (from plyr) &gt; system.time(print(summary(laply(strsplit(joyce,""), length)))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 17.20 0.05 17.30 &gt; # with ldply (from plyr) &gt; system.time(print(summary(ldply(strsplit(joyce,""), length)))) V1 Min. : 0.000 1st Qu.: 3.000 Median : 4.000 Mean : 4.666 3rd Qu.: 6.000 Max. :69.000 user system elapsed 7.97 0.00 8.03 </code></pre> <p>The vectorized function and <code>lapply</code> are considerably faster than the original <code>sapply</code> version. All solutions return the same answer (as seen by the summary output). </p> <p>Apparently the latest version of <code>plyr</code> is faster (this is using a slightly older version).</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload