StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhat is the fastest way to the delete lines in a file which have no match in a second file?
text
Body
copied!<p>I have two files, <code>wordlist.txt</code> and <code>text.txt</code>.</p> <p>The first file, <code>wordlist.txt</code>, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:</p> <pre><code>你你们我 </code></pre> <p>The second file, <code>text.txt</code>, contains long passages, e.g.:</p> <pre><code>你们要去哪里？卡拉OK好不好？ </code></pre> <p>I want to create a new word list (<code>wordsfount.txt</code>), but it should only contain those lines from <code>wordlist.txt</code> which are found at least once within <code>text.txt</code>. The output file from the above should show this:</p> <pre><code>你你们 </code></pre> <p>"我" is not found in this list because it is never found in <code>text.txt</code>.</p> <p>I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.</p> <p>I know a simple way in BASH to check each line in <code>worlist.txt</code> and see if it is in <code>text.txt</code> using <code>grep</code>:</p> <pre><code>a=1 while read line do c=`grep -c $line text.txt` if [ "$c" -ge 1 ] then echo $line >> wordsfound.txt echo "Found" $a fi echo "Not found" $a a=`expr $a + 1` done < wordlist.txt </code></pre> <p>Unfortunately, as <code>wordlist.txt</code> is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:</p> <p>As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:</p> <pre><code>我我们 </code></pre> <p>Due to this fact, if "我" is never found within <code>text.txt</code>, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing <code>wordlist.txt</code> that also contained within <code>wordlist.txt</code>. If there are about 8,000 unique characters found in <code>wordlist.txt</code>, then the script should not need to check so many lines.</p> <p>What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload