Note that there are some explanatory texts on larger screens.

plurals
  1. POIndexing and Searching Over Word Level Annotation Layers in Lucene
    primarykey
    data
    text
    <p>I have a data set with multiple layers of annotation over the underlying text, such as <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging" rel="noreferrer">part-of-tags</a>, <a href="http://www.cnts.ua.ac.be/conll2000/chunking/" rel="noreferrer">chunks from a shallow parser</a>, <a href="http://en.wikipedia.org/wiki/Named_entity_recognition" rel="noreferrer">name entities</a>, and others from various <a href="http://en.wikipedia.org/wiki/Natural_language_processing" rel="noreferrer">natural language processing</a> (NLP) tools. For a sentence like <code>The man went to the store</code>, the annotations might look like:</p> <pre> Word POS Chunk NER ==== === ===== ======== The DT NP Person man NN NP Person went VBD VP - to TO PP - the DT NP Location store NN NP Location </pre> <p>I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where <strong>Washington</strong> is tagged as a <strong>person</strong>. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:</p> <p><strong>Query</strong>: <code>Word=Washington,NER=Person</code> </p> <p>I'd also like to do more complex queries involving the <strong>sequential order of annotations</strong> across different layers, e.g. find all the documents where there's a word tagged <strong>person</strong> followed by the words <strong><code>arrived at</code></strong> followed by a word tagged <strong>location</strong>. Such a query might look like:</p> <p><strong>Query</strong>: <code>"NER=Person Word=arrived Word=at NER=Location"</code></p> <p>What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?</p> <p><strong>Payloads</strong></p> <p>One suggestion was to try to use Lucene <a href="http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/payloads/package-summary.html" rel="noreferrer">payloads</a>. But, I thought payloads could only be used to adjust the rankings of documents, and that they aren't used to select what documents are returned. </p> <p>The latter is important since, for some use-cases, the <strong>number of documents</strong> that contain a pattern is really what I want.</p> <p>Also, only the payloads on terms that match the query are examined. This means that <strong>payloads could only even help with the rankings of the first example query</strong>, <code>Word=Washington,NER=Person</code>, whereby we just want to make sure the term <strong><code>Washingonton</code></strong> is tagged as a <strong><code>Person</code></strong>. However, for the second example query, <code>"NER=Person Word=arrived Word=at NER=Location"</code>, I need to check the tags on unspecified, and thus non-matching, terms. </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload