Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>EDIT: I may have understood this better now. You want to compare graphs, represented as strings. The strings have "words" which may repeat. You may use Lucene, in which case I second the suggestion to use Solr. Basically, each Solr document will consist of a single field; The field will contain the string, which I suggest you unroll: write <code>C C</code> instead of <code>C:2</code>. If you use a space to separate the words, you can use a WhiteSpaceAnalyzer. If you use another separator, you may need to write a custom analyzer, which is not so hard to do.</p> <p>Is this a good idea? I am not sure. Here's why:</p> <ol> <li>Lucene (and Solr) do not use cosine similarity as such, but rather <a href="http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/Similarity.html" rel="nofollow">Lucene Similarity</a>, which mixes cosine, TF/IDF and boolean scoring, with some specific modifications. This works well for most textual use-cases, but may be different than what you need.</li> <li>Do you need to compare hits from different searches? If you do, it is hard to do using Solr, as it normalized every search to a maximal value of 1.</li> </ol> <p>I suggest you do try Solr for a small sample of your database. If Solr works for you, fine. If not, shingling and min-hashes are probably the way to go. <a href="http://infolab.stanford.edu/~ullman/mmds.html" rel="nofollow">Mining of Massive Datasets by Rajaraman and Ullman</a> is a recent free book about these subjects. I suggest you read it. It covers search for similar strings in mountains of data. I guess the differentiator is: Do you need a relatively large intersection? If so, use shingling and min-hashes. If not, maybe Solr is enough.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload