StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POadapting text search for graph/molecule comparison algorithms
text
Body
copied!<p>I'm looking for a text search engine for a non-traditional sort of text search and I want advice on which tool (Lucene, Sphinx, Xapian, or something else) is most appropriate for me, plus pointers on where to get started.</p> <p>I have molecules represented as graphs (atoms and bond). I have a way to <a href="http://dalkescientific.com/writings/diary/archive/2011/01/13/faster_subgraph_enumeration.html" rel="nofollow">enumerate all subgraphs</a> of up to size k. Being technical, the inputs are <a href="http://en.wikipedia.org/wiki/SMILES" rel="nofollow">SMILES</a> and the output is canonical SMARTS and the number of times each subgraph/SMARTS occurs.</p> <p>For example, if the input molecule is "<a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=702&loc=ec_rcs" rel="nofollow">CCO</a>" then the canonical results are {"C": 2, "O": 1, "CC": 1, "OC": 1, "CCO": 1} and if the molecule is "<a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=13737095&loc=ec_rcs" rel="nofollow">SCO</a>" then the canonical results are {"C": 1, "S": 1, "O": 1, "CS": 1, "OC": 1, "SCO": 1}. These are tiny examples. For real molecule I got around 500 "words", which look like "CC(C)O", "CCCOCC", "cn" and "cccc(c)O".</p> <p>Looking at molecules as a collections of characteristic strings plus counts means I should be able to use a text search tool to do comparisons at the text level, with the hopes that they are meaningful at the chemistry level.</p> <p>For examples, I can use <a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">cosine similarity</a> perhaps with <a href="http://en.wikipedia.org/wiki/Tf-idf" rel="nofollow">tf-idf</a> weight and find similar molecules by looking for similar subpatterns. With the "CCO" and "SCO" examples above, the cosine similarity is (2*1+1*1+1*1)/sqrt(2*2+1*1+1*1+1*1+1*1)/sqrt(6*(1*1)) = 4/sqrt(8*6) = 0.58.</p> <p>For another example, if I want to find molecules which contain a "CCS" substructure then I can do a fast inverted index search based on the counts (the molecules must have at least 2 "C"s, at least 1 "CS", and so on) before tackling the NP subgraph isomorphism problem. That is, text-based methods can act as a filter to reject obvious mismatches.</p> <p>I'm trying to figure out the text solutions which exist but it's a bit daunting. I don't need stop words, I don't need stemming, I don't care about word order; I don't need quite a number of the features which exist. I do need the ability to keep word vectors, since it's important to know if "C" appears 2 time or 3.</p> <p>Which text search engine is most appropriate for me? It looks like Lucene, especially with the work in Mahout. Can you recommend which parts of the documentation to look at or relevant tutorials? The ones I've found are meant for full-text searches, with stemming and the other features I don't need.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload