Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>We can rewrite this making it much more compact, eschewing the function. We'll do it in two steps, first we'll create a new column which holds a list (data.table columns can hold almost anything, even embedded data.tables), and then we'll extract these into a new data.table.</p> <pre><code>url_pattern &lt;- "http[^([:blank:]|\\\"|&lt;|&amp;|#\n\r)]+" db[(has_url), urls := str_match_all(text, url_pattern)] urls &lt;- db[(has_url), list(url=unlist(urls)), by=id] </code></pre> <p>Note that we use (has_url) instead of has_url == T, this uses binary indexing which is much faster (although in this case, most of the time is taken up by str_match_all, so it won't make that much difference). Make sure you use the () though, otherwise it won't work. </p> <p>The second line creates db$urls, which is a list of urls. The third line generates a new data.table, which has one entry for each URL, with the ID field linking it back to the forum post it came from. </p> <p>db has 146k rows, db[(has_url),] has 11k rows, and urls has 30k rows (some posts have several urls).</p> <p>Sample output from head(urls):</p> <pre><code>id url 14 http://reganmian.net/blog 44 http://vg.no 59 http://koran.co.id </code></pre> <p><strong>Update, simple reproducible example</strong></p> <p>Let's first generate some data</p> <pre><code>texts = c("Stian fruit:apple, fruit:banana and fruit:pear", "Peter fruit:apple", "fruit:banana is delicious", "I don't agree") DT &lt;- data.table(text = texts, id=1:length(texts)) DT text id 1: Stian fruit:apple, fruit:banana and fruit:pear 1 2: Peter fruit:apple 2 3: fruit:banana is delicious 3 4: I don't agree 4 </code></pre> <p>We want to grab all the "fruits" from the text column (each row might have one, several or no fruits). We first use str_match_all to put a list of individual fruits into a new column.</p> <pre><code>pattern &lt;- "fruit:\\S*" DT[, fruit_list := str_match_all(text, pattern)] </code></pre> <p>Now the fruit field looks like this:</p> <pre><code>&gt; DT[1]$fruit_list [[1]] [,1] [1,] "fruit:apple," [2,] "fruit:banana" [3,] "fruit:pear" </code></pre> <p>Now we want to extract the fruits into a new table, with one row per fruit, keeping the link back to the ID</p> <pre><code>fruits &lt;- DT[, list(fruit=unlist(fruit_list)), by=id] </code></pre> <p>And the result</p> <pre><code>&gt; fruits id fruit 1: 1 fruit:apple, 2: 1 fruit:banana 3: 1 fruit:pear 4: 2 fruit:apple 5: 3 fruit:banana </code></pre> <p>(thank you to Matthew Dowle and Ricardo Saporta on data.table-help mailing list)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload