Note that there are some explanatory texts on larger screens.

plurals
  1. POExclude duplicate results from Solr query based on highlight snippets?
    primarykey
    data
    text
    <p><strong>The scene:</strong></p> <p>I have indexed many websites using Nutch and Solr. I've implemented result grouping by site. My results output includes the page title, highlight snippets and URL. My issue is with the page navigation/copyright/company info bits that appear on many company sites.</p> <p>A query for "solder", for example, may return 200+ results for a particular site -- but only a handful of the results are actually appropriate; perhaps the company's site structure includes "solder" on every page as part of their core business description, site navigation, etc. There are relevant results to see, but they're flooded by the irrelevant, repetitive matches from the other pages on the site.</p> <p><strong>The problem:</strong></p> <p>I've seen other postings asking how to prevent Nutch and Solr from indexing site headers, footers, navigation and others but with such a diverse group of sites, this approach just isn't feasible. What I'm observing, however, is that although the content for each result is significantly different, the highlighted snippets returned are 90-100% identical for the results I don't want. Observe:</p> <pre><code>Products | Alloy Information || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms &amp; Conditions Products Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry http://www.--------.com/Products/AlloyInformation.aspx Products | Chemicals &amp; Cleaners || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms &amp; Conditions Products Industrial Division Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales http://www.--------.com/Products/ChemicalsCleaners.aspx Products | Rosin Based || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms &amp; Conditions Products Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical http://www.--------.com/Products/RosinBased.aspx Support | Engineering Guide || -------- -Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms &amp; Conditions Support Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical http://www.--------.com/Support/EngineeringGuide.aspx </code></pre> <p><strong>The Big Idea:</strong></p> <p>This leads me to the question of if I can filter or group results based on the highlighted snippets that are returned. I can't just group on the content because 1) the field is huge; and 2) the content is very different from page to page. If I could group, exclude or deduplicate results whose snippets were >85% identical, that would probably solve the problem. Perhaps some sort of post-processing step or some kind of tokenizer factory? Or a sort of idf for the search results rather than the entire document set?</p> <p>This seems like it would be a fairly common problem, and perhaps I've just missed how to do it. Essentially this is Google's "To blah blah your search, we have hidden xxx similar results. Click here to show them" feature.</p> <p>Thoughts?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload