Note that there are some explanatory texts on larger screens.

plurals
  1. POWord frequency tally script is too slow
    text
    copied!<h2>Background</h2> <p>Created a script to count the frequency of words in a plain text file. The script performs the following steps:</p> <ol> <li>Count the frequency of words from a corpus.</li> <li>Retain each word in the corpus found in a dictionary.</li> <li>Create a comma-separated file of the frequencies.</li> </ol> <p>The script is at: <a href="http://pastebin.com/VAZdeKXs" rel="nofollow">http://pastebin.com/VAZdeKXs</a></p> <pre><code>#!/bin/bash # Create a tally of all the words in the corpus. # echo Creating tally of word frequencies... sed -e 's/ /\n/g' -e 's/[^a-zA-Z\n]//g' corpus.txt | \ tr [:upper:] [:lower:] | \ sort | \ uniq -c | \ sort -rn &gt; frequency.txt echo Creating corpus lexicon... rm -f corpus-lexicon.txt for i in $(awk '{if( $2 ) print $2}' frequency.txt); do grep -m 1 ^$i\$ dictionary.txt &gt;&gt; corpus-lexicon.txt; done echo Creating lexicon... rm -f lexicon.txt for i in $(cat corpus-lexicon.txt); do egrep -m 1 "^[0-9 ]* $i\$" frequency.txt | \ awk '{print $2, $1}' | \ tr ' ' ',' &gt;&gt; lexicon.txt; done </code></pre> <h2>Problem</h2> <p>The following lines continually cycle through the dictionary to match words:</p> <pre><code>for i in $(awk '{if( $2 ) print $2}' frequency.txt); do grep -m 1 ^$i\$ dictionary.txt &gt;&gt; corpus-lexicon.txt; done </code></pre> <p>It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The <code>-m 1</code> parameter stops the scan when the match is found.)</p> <h2>Question</h2> <p>How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary.</p> <p>Thank you!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload