Note that there are some explanatory texts on larger screens.

plurals
  1. PORecord Matching algorithms for an inconsistent dataset
    text
    copied!<p>I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). Some brands have as many as 10 variances that need to be related together. </p> <p><strong>Issues:</strong></p> <hr> <ol> <li><strong>Inconsistant Spacing</strong>: Jet Boil VS Jetboil</li> <li><strong>Punctuation:</strong> Granger's VS Grangers</li> <li><strong>Noise Words:</strong> The North Face VS North Face</li> <li><strong>Taxomonies:</strong> Armada VS Armada Skis</li> <li><strong>Symbols:</strong> Phil and Teds VS Phil&amp;Teds</li> <li><strong>Mis-spelling:</strong> Patagonia VS Pategonia</li> <li><strong>Other Oddities:</strong> Bell Sports VS Bell Sports #81037</li> </ol> <p><strong>Example Dataset</strong></p> <hr> <pre><code>Black Diamond Black Diamond (Uda) Black Diamond Co Black Diamond Eq Ltd Black Diamond Eqp #76800 Black Diamond Equipment Black Dog Machine Llc Black Dome Press Black Dot Black Dragon Black Fire Black Flys Black Forest Girl Black Gold Black Hawk Inc. Black Hills Black Knight Black Label Black Magic Black Marine Black Market Bikes Black Max Black Opal Black Ops Black Rain Ordance Inc. Black Rain Ordnance Black Rapid Black Ribbon Black Rifle Disease Engineerin Black River Bucks Black Seal Black Seed Black Swan Black Tower Black Widow Black's </code></pre> <hr> <p><strong>Consequences (as suggested in a comment)</strong></p> <ul> <li>An incorrect association will result in unrelated brands being displayed in product searches and thus weaken the usability of the presentation layer</li> <li>Missing an association will result in the same brand being displayed multiple in a filter list and thus weaken the usability of the presentation layer</li> </ul> <p>I realize that is is a large problem and likely beyond the scope of what can be resolved in a stack overflow article, but I'm looking for inspirations on how to tackle this problem.</p> <p>Any algorithm, software pattern, or process that may help is welcome. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload