StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POIndexing and Searching Over Word Level Annotation Layers in Lucene
primarykey
Id
2883012
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2010-05-21T14:37:32.023
FavoriteCount
5
LastActivityDate
2011-05-08T04:49:17.023
LastEditDate
2010-05-22T06:49:47.757
LastEditorUserId
86542
OwnerUserId
86542
ParentId
0
PostTypeId
1
Score
8
ViewCount
1148
LastEditorDisplayName
text
Body
I have a data set with multiple layers of annotation over the underlying text, such as <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging" rel="noreferrer">part-of-tags</a>, <a href="http://www.cnts.ua.ac.be/conll2000/chunking/" rel="noreferrer">chunks from a shallow parser</a>, <a href="http://en.wikipedia.org/wiki/Named_entity_recognition" rel="noreferrer">name entities</a>, and others from various <a href="http://en.wikipedia.org/wiki/Natural_language_processing" rel="noreferrer">natural language processing</a> (NLP) tools. For a sentence like <code>The man went to the store</code>, the annotations might look like: <pre> Word POS Chunk NER ==== === ===== ======== The DT NP Person man NN NP Person went VBD VP - to TO PP - the DT NP Location store NN NP Location </pre> I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows: Query: <code>Word=Washington,NER=Person</code> I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words <code>arrived at</code> followed by a word tagged location. Such a query might look like: Query: <code>"NER=Person Word=arrived Word=at NER=Location"</code> What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens? Payloads One suggestion was to try to use Lucene <a href="http://lucene.apache.org/java/2_9_2/api/all/org/apache/lucene/search/payloads/package-summary.html" rel="noreferrer">payloads</a>. But, I thought payloads could only be used to adjust the rankings of documents, and that they aren't used to select what documents are returned. The latter is important since, for some use-cases, the number of documents that contain a pattern is really what I want. Also, only the payloads on terms that match the query are examined. This means that payloads could only even help with the rankings of the first example query, <code>Word=Washington,NER=Person</code>, whereby we just want to make sure the term <code>Washingonton</code> is tagged as a <code>Person</code>. However, for the second example query, <code>"NER=Person Word=arrived Word=at NER=Location"</code>, I need to check the tags on unspecified, and thus non-matching, terms. 
Tags
<java><lucene><nlp><data-mining><text-mining>
Title
Indexing and Searching Over Word Level Annotation Layers in Lucene
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USdmcer
UserOwnerUserId
1. USdmcer
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POIndexing and Searching Over Word Level Annotation Layers in Lucene
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POIndexing and Searching Over Word Level Annotation Layers in Lucene
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POIndexing and Searching Over Word Level Annotation Layers in Lucene
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COdid you find a satisfying solution?
 singulars
 PostPostId
 POIndexing and Searching Over Word Level Annotation Layers in Lucene
 UserUserId
 USenguerran

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.