Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>What you are doing is building a <a href="https://github.com/MLCL/Byblo" rel="nofollow">distributional thesaurus</a>-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since <code>Alice</code>, the <code>Hare</code> and the <code>Queen</code> tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.</p> <p>Have a look at the following papers to get an idea of what people do with distributional semantics:</p> <ul> <li><a href="http://acl.ldc.upenn.edu/P/P98/P98-2127.pdf" rel="nofollow">Lin 1998</a></li> <li><a href="http://books.google.co.uk/books?hl=en&amp;lr=&amp;id=ZAxgQRIgzIcC&amp;oi=fnd&amp;pg=PR9&amp;dq=Explorations%20in%20Automatic%20Thesaurus%20Discovery&amp;ots=TbaZbyoWI5&amp;sig=bgmYT9FFgK5dkeRlcixXhl9fZmg#v=onepage&amp;q=Explorations%20in%20Automatic%20Thesaurus%20Discovery&amp;f=false" rel="nofollow">Grefenstette 1994</a></li> <li><a href="http://dl.acm.org/citation.cfm?id=972724" rel="nofollow">Schuetze 1998</a></li> </ul> <p>To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).</p> <p>As <code>larsmans</code> pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload