Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>It depends on how strict your definition of <em>similar</em> is.</p> <h2>Machine Learning Techniques</h2> <p>As <a href="https://stackoverflow.com/a/14225866/1988505">others</a> have pointed out, you can use something like <a href="https://en.wikipedia.org/wiki/Latent_semantic_analysis" rel="nofollow noreferrer">latent semantic analysis</a> or the related <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation" rel="nofollow noreferrer">latent Dirichlet allocation</a>.</p> <h2>Semantic Similarity and WordNet</h2> <p>As was <a href="https://stackoverflow.com/a/14149014/1988505">pointed out</a>, you may wish to use an existing resource for something like this.</p> <p>Many research papers (<a href="http://www.aclweb.org/anthology-new/J/J06/J06-1003.pdf" rel="nofollow noreferrer">example</a>) use the term <em>semantic similarity</em>. The basic idea is of computing this is usually done by finding the <a href="https://en.wikipedia.org/wiki/Distance_(graph_theory)" rel="nofollow noreferrer">distance</a> between two words on a graph, where a word is a child if it is a type of its parent. Example: "songbird" would be a child of "bird". Semantic similarity can be used as a distance metric for creating clusters, if you wish.</p> <h3>Example Implementation</h3> <p>In addition, if you put a threshold on the value of some semantic similarity measure, you can get a boolean <code>True</code> or <code>False</code>. Here is a Gist I created (<a href="https://gist.github.com/3949778" rel="nofollow noreferrer">word_similarity.py</a>) that uses <a href="http://nltk.org/" rel="nofollow noreferrer">NLTK's</a> corpus reader for <a href="http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html" rel="nofollow noreferrer">WordNet</a>. Hopefully that points you towards the right direction, and gives you a few more search terms.</p> <pre><code>def sim(word1, word2, lch_threshold=2.15, verbose=False): """Determine if two (already lemmatized) words are similar or not. Call with verbose=True to print the WordNet senses from each word that are considered similar. The documentation for the NLTK WordNet Interface is available here: http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html """ from nltk.corpus import wordnet as wn results = [] for net1 in wn.synsets(word1): for net2 in wn.synsets(word2): try: lch = net1.lch_similarity(net2) except: continue # The value to compare the LCH to was found empirically. # (The value is very application dependent. Experiment!) if lch &gt;= lch_threshold: results.append((net1, net2)) if not results: return False if verbose: for net1, net2 in results: print net1 print net1.definition print net2 print net2.definition print 'path similarity:' print net1.path_similarity(net2) print 'lch similarity:' print net1.lch_similarity(net2) print 'wup similarity:' print net1.wup_similarity(net2) print '-' * 79 return True </code></pre> Example output <pre><code>&gt;&gt;&gt; sim('college', 'academy') True &gt;&gt;&gt; sim('essay', 'schoolwork') False &gt;&gt;&gt; sim('essay', 'schoolwork', lch_threshold=1.5) True &gt;&gt;&gt; sim('human', 'man') True &gt;&gt;&gt; sim('human', 'car') False &gt;&gt;&gt; sim('fare', 'food') True &gt;&gt;&gt; sim('fare', 'food', verbose=True) Synset('fare.n.04') the food and drink that are regularly served or consumed Synset('food.n.01') any substance that can be metabolized by an animal to give energy and build tissue path similarity: 0.5 lch similarity: 2.94443897917 wup similarity: 0.909090909091 ------------------------------------------------------------------------------- True &gt;&gt;&gt; sim('bird', 'songbird', verbose=True) Synset('bird.n.01') warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings Synset('songbird.n.01') any bird having a musical call path similarity: 0.25 lch similarity: 2.25129179861 wup similarity: 0.869565217391 ------------------------------------------------------------------------------- True &gt;&gt;&gt; sim('happen', 'cause', verbose=True) Synset('happen.v.01') come to pass Synset('induce.v.02') cause to do; cause to act in a specified manner path similarity: 0.333333333333 lch similarity: 2.15948424935 wup similarity: 0.5 ------------------------------------------------------------------------------- Synset('find.v.01') come upon, as if by accident; meet with Synset('induce.v.02') cause to do; cause to act in a specified manner path similarity: 0.333333333333 lch similarity: 2.15948424935 wup similarity: 0.5 ------------------------------------------------------------------------------- True </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload