StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POIdentifying top recurring words from a list of e-mails based on a dictionary of interesting words
text
Body
copied!<p>A directory D contains a few thousand e-mails in the .eml format. Some e-mails are plain text, others come from Outlook, others have an ASCII header and HTML/MIME content and so on. There exists a dictionary file F containing a list of interesting words (i.e. red\nblue\ngreen\n...) to look for in the files underneath the D directory. The D directory has a large number of subfolders but no files other than the above-mentioned .eml files. A list of top recurring words should be made with these specifications:</p> <ul> <li>For every interesting word, information should be provided concerning how many times it occurs and where it does. If it occurs multiple times within a file, it should be reported multiple times for that file. Reporting occurrence means reporting a tuple (L,P) of integers, where L is the line number from the top of the e-mail source and P is the position, within that line ,of the start of the occurrence.</li> </ul> <p>This would build both an index to refer to the different occurrences and a summary of the most frequently occurring interesting words.</p> <p>The output should be on a single output file and the format is not strictly defined, provided the information above is included: interesting words, number of times each interesting word occurs and where it does -> file/line/start-position.</p> <p>This is not a homework exercise but actual text analysis I would like to make of a fairly large dataset. The challenge I am having is that of choosing the right tool for filtering efficiently. An iterative approach, Cartesian product of words/emails/etc, is too slow and it would be desirable to combine multiple word filtering for each line of each file.</p> <p>I have experimented building a regex of alternatives from the list of interesting words, w1|w2|w3|..., compiling that and running it through each line of each e-mail but it's still slow, especially when I need to check multiple occurrences within a single line.</p> <p>Example:</p> <p>E-mail E has a line containing the text:</p> <p>^ ... blah ... red apples ... blue blueberries ... red, white and blue flag.$\n</p> <p>the regex correctly reports red(2) and blue(2) but it's slow when using the real, very large dictionary of interesting words.</p> <p>Another approach I have tried is:</p> <p>use a Sqlite database to dump tokens to as they are parsed, including (column,position) information for each entry, and just querying the output at the end. Batch inserts help a lot, with the appropriate in-memory buffer, but increase complexity.</p> <p>I have not experimented with data parallelisation yet as I am not sure tokens/parsing are the right thing to do in the first place. Maybe a tree of letters would be more suitable?</p> <p>I am interested in solutions in, in order of preference:</p> <ul> <li>Bash/GNU CLI tools (esp. something parallelisable through GNU 'parallel'for CLI-only execution)</li> <li>Python (NLP?)</li> <li>C/C++</li> </ul> <p>No Perl as I don't understand it, unfortunately.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload