StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POExclude duplicate results from Solr query based on highlight snippets?
primarykey
Id
7418778
data
AcceptedAnswerId
7448330
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2011-09-14T15:17:41.493
FavoriteCount
0
LastActivityDate
2011-09-16T17:27:47.973
LastEditDate
LastEditorUserId
0
OwnerUserId
550199
ParentId
0
PostTypeId
1
Score
1
ViewCount
932
LastEditorDisplayName
text
Body
The scene: I have indexed many websites using Nutch and Solr. I've implemented result grouping by site. My results output includes the page title, highlight snippets and URL. My issue is with the page navigation/copyright/company info bits that appear on many company sites. A query for "solder", for example, may return 200+ results for a particular site -- but only a handful of the results are actually appropriate; perhaps the company's site structure includes "solder" on every page as part of their core business description, site navigation, etc. There are relevant results to see, but they're flooded by the irrelevant, repetitive matches from the other pages on the site. The problem: I've seen other postings asking how to prevent Nutch and Solr from indexing site headers, footers, navigation and others but with such a diverse group of sites, this approach just isn't feasible. What I'm observing, however, is that although the content for each result is significantly different, the highlighted snippets returned are 90-100% identical for the results I don't want. Observe: <pre><code>Products | Alloy Information || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry http://www.--------.com/Products/AlloyInformation.aspx Products | Chemicals & Cleaners || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Industrial Division Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales http://www.--------.com/Products/ChemicalsCleaners.aspx Products | Rosin Based || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical http://www.--------.com/Products/RosinBased.aspx Support | Engineering Guide || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Support Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical http://www.--------.com/Support/EngineeringGuide.aspx </code></pre> The Big Idea: This leads me to the question of if I can filter or group results based on the highlighted snippets that are returned. I can't just group on the content because 1) the field is huge; and 2) the content is very different from page to page. If I could group, exclude or deduplicate results whose snippets were >85% identical, that would probably solve the problem. Perhaps some sort of post-processing step or some kind of tokenizer factory? Or a sort of idf for the search results rather than the entire document set? This seems like it would be a fairly common problem, and perhaps I've just missed how to do it. Essentially this is Google's "To blah blah your search, we have hidden xxx similar results. Click here to show them" feature. Thoughts?
Tags
<html><search><solr><nutch>
Title
Exclude duplicate results from Solr query based on highlight snippets?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USmlerley
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POExclude duplicate results from Solr query based on highlight snippets?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.