Note that there are some explanatory texts on larger screens.

plurals
  1. POMost efficient way/library to detect predefined keywords in billions of lines?
    primarykey
    data
    text
    <p>Let's say I have a few billion lines of text, and a few million "keyword"s. The task is to go through these lines and see which line contains which keywords. In other words, given a map of <code>(K1 -&gt; V1)</code> and <code>(K2 -&gt; V2)</code>, create a map of <code>(K2 -&gt; K1)</code> where <code>K1=lineID</code>, <code>V1=text</code>, <code>K2=keywordID</code> and <code>V2=keyword</code>. Note also that:</p> <ul> <li>All text/keywords are English</li> <li>Text (V1) may contain spelling mistakes.</li> <li>Most keywords (V2) are single words, but some keywords may consist of more than one English word (e.g. "clean towel")</li> </ul> <p>So far my initial idea to solve this is as follows:</p> <pre><code>1) Chop up all my keywords into single words and create a large set of single words (K3) 2) Construct a BK-Tree out of these chopped up keywords, using Levenshtein distance 3) For each line of data (V1), 3.1) Chop up the text (V1) into words 3.2) For each said word, 3.2.1) Retrieve words (K3) from the BK-Tree that are close enough to said word 3.3) Since at this point we still have false positives, (e.g. we would have matched "clean" from "clean water" against keyword "clean towel"), we check all possible combination using a trie of keyword (V2) to filter such false positives out. We construct this trie so that at the end of an successful match, the keywordID (K2) can be retrieved. 3.4) Return the correct set of keywordID (K2) for this line (V1)! 4) Profit! </code></pre> <p><strong>My questions</strong> </p> <ul> <li>Is this a good approach? Efficiency is very important -- are there any better ways? Anything to improve?</li> <li>Are there any libraries I could use? Preferably something that would work well with Java.</li> </ul> <p>Thanks in advance!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload