Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I assume you can create/find an eml-to-text converter. Then this is fairly close to what you want:</p> <pre><code>find -type f | parallel --tag 'eml-to-text {} | grep -o -n -b -f /tmp/list_of_interesting_words' </code></pre> <p>The output is not formatted 100% how you want it:</p> <p>filename \t line no : byte no (from start of file) : word</p> <p>If you have many interesting words the '-f' in <code>grep</code> is slow to start up, so if you can create an unpacked version of your maildir you can make parallel start <code>grep</code> fewer times:</p> <pre><code>find . -type f | parallel 'eml-to-text {} &gt;/tmp/unpacked/{#}' find /tmp/unpacked -type f | parallel -X grep -H -o -n -b -f /tmp/list_of_interesting_words </code></pre> <p>Since the time complexity of <code>grep -f</code> is worse than linear, you may want to chop up /tmp/list_of_interesting_words into smaller blocks:</p> <pre><code>cat /tmp/list_of_interesting_words | parallel --pipe --block 10k --files &gt; /tmp/blocks_of_words </code></pre> <p>And then process the blocks and the files in parallel:</p> <pre><code>find /tmp/unpacked -type f | parallel -j1 -I ,, parallel --arg-file-sep // -X grep -H -o -n -b -f ,, {} // - :::: /tmp/blocks_of_words </code></pre> <p>This output is formatted like:</p> <p>filename : line no : byte no (from start of file) : word</p> <p>To have it grouped by <code>word</code> instead of filename pipe the result through sort:</p> <pre><code>... | sort -k4 -t: &gt; index.by.word </code></pre> <p>To count the frequency:</p> <pre><code>... | sort -k4 -t: | tee index.by.word | awk 'FS=":" {print $4}' | uniq -c </code></pre> <p>The good news is that this should be rather fast, and I doubt you will be able to achieve the same speed using Python.</p> <p>Edit:</p> <p>grep -F is way faster at starting, and you will want -w for grep (so the word 'gram' does not match 'diagrams'); this will also avoid the temporary files and is probably reasonably fast:</p> <pre><code>find . -type f | parallel --tag 'eml-to-text {} | grep -F -w -o -n -b -f /tmp/list_of_interesting_words' | sort -k3 -t: | tee index.by.word | awk 'FS=":" {print $3}' | uniq -c </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload