Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Some remarks:</p> <ul> <li>We don't wish to do something along the lines of "for all emails, perform a regex search and do_something();". I can imagine the length of most emails to be shorter than the list of interesting words so I'd try to process each email individually and extract the necesary information.</li> <li>Build a specialized string data structure (such as a <a href="https://en.wikipedia.org/wiki/Trie" rel="nofollow">string trie</a> or <a href="https://en.wikipedia.org/wiki/Ternary_search_tree" rel="nofollow">ternary search tree</a>) to quickly lookup whether a word is interesting or not. I'd have good experiences with building a ternary search tree of words as it allows quick lookups of words.</li> <li>The algorithm would then look like this:</li> </ul> <p>(in pseudocode of course)</p> <pre><code>result &lt;- empty list for each email e: for each word w: if is_interesting_word(w, string_data_structure): add (filename, line_number, start_position, word) to results </code></pre> <ul> <li>The problem is now quite suitable for parallelize with techniques such as <a href="https://en.wikipedia.org/wiki/MapReduce" rel="nofollow">MapReduce</a> (e.g., <a href="https://en.wikipedia.org/wiki/Apache_Hadoop" rel="nofollow">Hadoop</a>). Each email can be processed independently of the others and no information needs to be shared: the string data structure can be computed prior to processing the emails. In the map step, you extract the necessary information from an email and in the reduce step, you merge the computed values from each email into a single output file.</li> </ul> <p>I would reduce the amount of processing that you need: no regex, no advanced parsing; just walk over each character/line in an email and keep track of where you are (line number, position, et cetera). As a final step, profile your code and optimize where it hurts :)</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload