StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
9432450
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
11
CommunityOwnedDate
CreationDate
2012-02-24T14:34:54.240
FavoriteCount
0
LastActivityDate
2015-10-13T21:46:38.843
LastEditDate
2015-10-13T21:46:38.843
LastEditorUserId
1955346
OwnerUserId
819598
ParentId
9421358
PostTypeId
2
Score
134
ViewCount
0
LastEditorDisplayName
text
Body
You have various problems with what you pasted: 1) Incorrect mapping When creating the index, you specify: <pre><code>"mappings": { "files": { </code></pre> But your type is actually <code>file</code>, not <code>files</code>. If you checked the mapping, you would see that immediately: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' # { # "files" : { # "files" : { # "properties" : { # "filename" : { # "type" : "string", # "analyzer" : "filename_analyzer" # } # } # }, # "file" : { # "properties" : { # "filename" : { # "type" : "string" # } # } # } # } # } </code></pre> 2) Incorrect analyzer definition You have specified the <code>lowercase</code> tokenizer but that removes anything that isn't a letter, (see <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenizer.html">docs</a>), so your numbers are being completely removed. You can check this with the <a href="http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze.html">analyze API</a>: <pre><code>curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' # { # "tokens" : [ # { # "end_offset" : 2, # "position" : 1, # "start_offset" : 0, # "type" : "word", # "token" : "my" # }, # { # "end_offset" : 7, # "position" : 2, # "start_offset" : 3, # "type" : "word", # "token" : "file" # }, # { # "end_offset" : 22, # "position" : 3, # "start_offset" : 19, # "type" : "word", # "token" : "doc" # } # ] # } </code></pre> 3) Ngrams on search You include your ngram token filter in both the index analyzer and the search analyzer. That's fine for the index analyzer, because you want the ngrams to be indexed. But when you search, you want to search on the full string, not on each ngram. For instance, if you index <code>"abcd"</code> with ngrams of length 1 to 4, you will end up with these tokens: <pre><code>a b c d ab bc cd abc bcd </code></pre> But if you search on <code>"dcba"</code> (which shouldn't match) and you also analyze your search terms with ngrams, then you are actually searching on: <pre><code>d c b a dc cb ba dbc cba </code></pre> So <code>a</code>,<code>b</code>,<code>c</code> and <code>d</code> will match! Solution First, you need to choose the right analyzer. Your users will probably search for words, numbers or dates, but they probably won't expect <code>ile</code> to match <code>file</code>. Instead, it will probably be more useful to use <a href="http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer.html">edge ngrams</a>, which will anchor the ngram to the start (or end) of each word. Also, why exclude <code>docx</code> etc? Surely a user may well want to search on the file type? So lets break up each filename into smaller tokens by removing anything that isn't a letter or a number (using the <a href="http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-tokenizer.html">pattern tokenizer</a>): <pre><code>My_first_file_2012.01.13.doc => my first file 2012 01 13 doc </code></pre> Then for the index analyzer, we'll also use edge ngrams on each of those tokens: <pre><code>my => m my first => f fi fir firs first file => f fi fil file 2012 => 2 20 201 201 01 => 0 01 13 => 1 13 doc => d do doc </code></pre> We create the index as follows: <pre><code>curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1' -d ' { "settings" : { "analysis" : { "analyzer" : { "filename_search" : { "tokenizer" : "filename", "filter" : ["lowercase"] }, "filename_index" : { "tokenizer" : "filename", "filter" : ["lowercase","edge_ngram"] } }, "tokenizer" : { "filename" : { "pattern" : "[^\\p{L}\\d]+", "type" : "pattern" } }, "filter" : { "edge_ngram" : { "side" : "front", "max_gram" : 20, "min_gram" : 1, "type" : "edgeNGram" } } } }, "mappings" : { "file" : { "properties" : { "filename" : { "type" : "string", "search_analyzer" : "filename_search", "index_analyzer" : "filename_index" } } } } } ' </code></pre> Now, test that the our analyzers are working correctly: filename_search: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' [results snipped] "token" : "my" "token" : "first" "token" : "file" "token" : "2012" "token" : "01" "token" : "13" "token" : "doc" </code></pre> filename_index: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' "token" : "m" "token" : "my" "token" : "f" "token" : "fi" "token" : "fir" "token" : "firs" "token" : "first" "token" : "f" "token" : "fi" "token" : "fil" "token" : "file" "token" : "2" "token" : "20" "token" : "201" "token" : "2012" "token" : "0" "token" : "01" "token" : "1" "token" : "13" "token" : "d" "token" : "do" "token" : "doc" </code></pre> OK - seems to be working correctly. So let's add some docs: <pre><code>curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }' curl -X POST "http://localhost:9200/files/_refresh" </code></pre> And try a search: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : "2012.01" } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "PsDvfFCkT4yvJnlguxJrrQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "ER5RmyhATg-Eu92XNGRu-w", # "_type" : "file" # } # ], # "max_score" : 0.06780553, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # } </code></pre> Success! #### UPDATE #### I realised that a search for <code>2012.01</code> would match both <code>2012.01.12</code> and <code>2012.12.01</code> so I tried changing the query to use a <a href="http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html">text phrase</a> query instead. However, this didn't work. It turns out that the edge ngram filter increments the position count for each ngram (while I would have thought that the position of each ngram would be the same as for the start of the word). The issue mentioned in point (3) above is only a problem when using a <code>query_string</code>, <code>field</code>, or <code>text</code> query which tries to match ANY token. However, for a <code>text_phrase</code> query, it tries to match ALL of the tokens, and in the correct order. To demonstrate the issue, index another doc with a different date: <pre><code>curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }' curl -X POST "http://localhost:9200/files/_refresh" </code></pre> And do a the same search as above: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : { "query" : "2012.01" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.22097087, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.22097087, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 5 # } </code></pre> The first result has a date <code>2012.12.01</code> which isn't the best match for <code>2012.01</code>. So to match only that exact phrase, we can do: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text_phrase" : { "filename" : { "query" : "2012.01", "analyzer" : "filename_index" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.55737644, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 7 # } </code></pre> Or, if you still want to match all 3 files (because the user might remember some of the words in the filename, but in the wrong order), you can run both queries but increase the importance of the filename which is in the correct order: <pre><code>curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "bool" : { "should" : [ { "text_phrase" : { "filename" : { "boost" : 2, "query" : "2012.01", "analyzer" : "filename_index" } } }, { "text" : { "filename" : "2012.01" } } ] } } } ' # [Fri Feb 24 16:31:02 2012] Response: # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.012931341, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # } # ], # "max_score" : 0.56892186, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # } </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POFilename search with ElasticSearch
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USMarcin Zablocki
UserOwnerUserId
1. USDrTech
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POFilename search with ElasticSearch
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.