Note that there are some explanatory texts on larger screens.

plurals
  1. POChoose or generate canonical variant from multiple sentences
    primarykey
    data
    text
    <p>I'm working with an API that maps my GTIN/EAN queries to product data.</p> <p>Since the data returned originates from merchant product feeds, the following is almost universally the case:</p> <ul> <li>Multiple results per GTIN</li> <li>Products' titles are pretty much unstructured</li> <li>Products' titles are "polluted" with <ul> <li>SEO-related stuff,</li> <li>information about the quantity contained,</li> <li>"buy two, get one free" offers,</li> <li>etc.</li> </ul></li> </ul> <p><strong>I'm looking for a programmatic way to either</strong></p> <ul> <li><em><strong>choose</em> the "cleanest"/most canonical version available</strong></li> <li><strong>or <em>generate</em> a new one that represents the "lowest common denominator".</strong></li> </ul> <p>Consider the following example results for a single EAN query:</p> <ul> <li>Nivea Deo Roll-On Dry Impact for Men</li> <li>NIVEA DEO Roll on Dry/blau</li> <li>Nivea Deo Roll-On Dry Impact for Men, 50 ml, 3er Pack (3 x 50 ml)</li> <li>Nivea Deo Roll on Dry/blau 50 ml</li> <li>Nivea Deoroller 50ml dry for Men blau Mindestabnahme: 6 Stück (1 VE)</li> <li>NIVEA Deoroller, Dry Impact for Men</li> <li>NIVEA DEO Roll on Dry/blau_50 ml</li> </ul> <p><strong>My homebrew approach looks like this:</strong></p> <ul> <li>Basic cleanup: <ul> <li>Lowercase the titles,</li> <li>strip excessive whitespace,</li> <li>throw out apparent stopwords such as "buy" and "click"</li> </ul></li> <li>Build an array for <code>word =&gt; global occurence</code> <ul> <li><code>"Nivea" =&gt; 7</code></li> <li><code>"Deo" =&gt; 5</code></li> <li><code>"Deoroller" =&gt; 2</code></li> <li><code>…</code></li> <li><code>"VE" =&gt; 1</code></li> </ul></li> <li>Calculate the "cumulative word value" for each of the titles <ul> <li><code>"Nivea Deo" =&gt; 12</code></li> <li><code>"Nivea Deoroller VE" =&gt; 10</code></li> </ul></li> <li>Divide the cumulative value by the length of the title, resulting in a score <ul> <li><code>"Nivea Deo" =&gt; 6</code></li> <li><code>"Nivea Deoroller VE" =&gt; 3.34</code></li> </ul></li> </ul> <p>Obviously, my approach is pretty basic, error-prone and biased towards short sentences with frequently used words – yielding more or less satisfactory results.</p> <ul> <li><strong>Would you choose a different approach?</strong></li> <li><strong>Is there some NLP magic way to take care of the problem that I don't know of?</strong></li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload