Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat is the fastest way to the delete lines in a file which have no match in a second file?
    primarykey
    data
    text
    <p>I have two files, <code>wordlist.txt</code> and <code>text.txt</code>.</p> <p>The first file, <code>wordlist.txt</code>, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:</p> <pre><code>你 你们 我 </code></pre> <p>The second file, <code>text.txt</code>, contains long passages, e.g.:</p> <pre><code>你们要去哪里? 卡拉OK好不好? </code></pre> <p>I want to create a new word list (<code>wordsfount.txt</code>), but it should only contain those lines from <code>wordlist.txt</code> which are found at least once within <code>text.txt</code>. The output file from the above should show this:</p> <pre><code>你 你们 </code></pre> <p>"我" is not found in this list because it is never found in <code>text.txt</code>.</p> <p>I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.</p> <p>I know a simple way in BASH to check each line in <code>worlist.txt</code> and see if it is in <code>text.txt</code> using <code>grep</code>:</p> <pre><code>a=1 while read line do c=`grep -c $line text.txt` if [ "$c" -ge 1 ] then echo $line &gt;&gt; wordsfound.txt echo "Found" $a fi echo "Not found" $a a=`expr $a + 1` done &lt; wordlist.txt </code></pre> <p>Unfortunately, as <code>wordlist.txt</code> is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:</p> <p>As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:</p> <pre><code>我 我们 </code></pre> <p>Due to this fact, if "我" is never found within <code>text.txt</code>, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing <code>wordlist.txt</code> that also contained within <code>wordlist.txt</code>. If there are about 8,000 unique characters found in <code>wordlist.txt</code>, then the script should not need to check so many lines.</p> <p>What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COSuppose `我` is in `wordlist.txt` but `我们` isn't. Suppose that `我们` appears in `text.txt` is that a match for `我`? I.e. are you really matching words, or just arbitrary substrings of Chinese characters, which could be fragments of words?
      singulars
    2. COMy goal is to create a new, shortened wordlist.txt, which does not contain words which do not match, so that later, more complex scripts, which take many hours to do the work, can do their work much more quickly. The new list is about 5% of the original length. If "我们" is found, but “我" is never found in isolation, ideally, the new word list does not show "我", but if this additional check is very difficult to implement, then it is unnecessary.
      singulars
    3. CONot for nothin' Village, but you keep asking "is there a faster way"? The frank and honest truth is no, not really. There's no way faster than brute force to check for a value in an unsorted set, and there never will be. You can add a bunch of specific criterion to make use of binary searches, but the general case will never be faster than brute force. Sorry. Searches are an insanely consumptive process, and tons of research is being done into how to optimize them, but generally they involve ordering the data in some way.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload