Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine. </p> <p>I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need: </p> <ol> <li><strong>All terms</strong> or at least <strong>the most frequent ones</strong>, because rare terms won't affect result of association mining by their nature.</li> <li><strong>Documents where these terms occur</strong>, again, at least top documents. </li> </ol> <p>Both these tasks may be easily done with standard Solr components. </p> <p>To retrieve terms <a href="http://wiki.apache.org/solr/TermsComponent" rel="nofollow">TermsComponent</a> or <a href="http://www.lucidimagination.com/devzone/technical-articles/faceted-search-solr#faceting_impl" rel="nofollow">faceted search</a> may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details). </p> <p>Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document. </p> <p>Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like <a href="http://www.cs.waikato.ac.nz/ml/weka/" rel="nofollow">Weka</a> or write own implementation of, say, <a href="http://en.wikipedia.org/wiki/Apriori_algorithm" rel="nofollow">Apriori algorithm</a>. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload