Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Sphinx wont do it on its own. It can't just 'magically' group similar items into clusters of likely duplicate items. </p> <p>(if the titles where identical - charactor for charactor, could just group by, but thats not the case in your example) </p> <p>Once you've got your documents into clusters - eg assigned them a 'cluster-id'. Eg the two items in your example, would have the same cluster-id. A unique article not mentioned by mulitple sources would have its own id. - Sphinx could then help you search or render results - using the built in group by. </p> <hr> <p>So first you need to cluster your documents. </p> <p>There are dedicated tools for this type of thing, for example: <a href="https://github.com/open-city/dedupe" rel="nofollow">https://github.com/open-city/dedupe</a></p> <p>But a very basic one could actully be built with sphinx. Would probably work ok in your example, because the titles contain the same words, just in different order. </p> <p>Basically just need a script that loops though all documents that DONT have a cluster-id, then run a sphinx search against the index, looking for duplicates. If one is found, duplicate its cluster-id, otherwise just allocate a fresh unique id. </p> <p>This script can then just be run after inserting news documents, to 'cluster' any new stories. </p> <p>The exact sphinx query can be varied. eg just including the words in a basic query, would require all the same words - regardless of order. But could also perhaps use a quorum search to require most words matching etc. </p> <p>Might also want to filter by date to avoid dupluicating stories from wildly differnt dates. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload