Note that there are some explanatory texts on larger screens.

plurals
  1. POFinding the most frequent subtrees in a collection of (parse) trees
    text
    copied!<p>I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see <a href="http://en.wikipedia.org/wiki/Treebank" rel="nofollow noreferrer">http://en.wikipedia.org/wiki/Treebank</a>). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.</p> <p><strong>EDIT</strong> @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".</p> <pre><code>&lt;sentence&gt; &lt;nounPhrase&gt; &lt;article/&gt; &lt;noun/&gt; &lt;/nounPhrase&gt; &lt;verbPhrase&gt; &lt;verb/&gt; &lt;prepositionPhrase&gt; &lt;preposition/&gt; &lt;nounPhrase&gt; &lt;article/&gt; &lt;noun/&gt; &lt;/nounPhrase&gt; &lt;/prepositionPhrase&gt; &lt;/verbPhrase&gt; &lt;/sentence&gt; </code></pre> <p>Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:</p> <pre><code> &lt;nounPhrase&gt; &lt;article/&gt; &lt;adjective/&gt; &lt;adjective/&gt; &lt;noun/&gt; &lt;/nounPhrase&gt; </code></pre> <p>The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable. </p> <p>Obviously the shorter the tree the more frequent, so nounPhrase will be very common.</p> <p><strong>EDIT</strong> If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload