StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POFuzzy Group By, Grouping Similar Words
text
Body
copied!<p>this question is asked here before</p> <p><a href="https://stackoverflow.com/questions/6579263/what-is-a-good-strategy-to-group-similar-words">What is a good strategy to group similar words?</a></p> <p>but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping? </p> <p>I would like to reduce </p> <pre><code>['ape', 'appel', 'apple', 'peach', 'puppy'] </code></pre> <p>to </p> <pre><code>['ape', 'appel', 'peach', 'puppy'] </code></pre> <p>or</p> <pre><code>['ape', 'apple', 'peach', 'puppy'] </code></pre> <p>One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change. </p> <p>I would appreciate any pointers, names of libraries, etc.</p> <p>Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there? </p> <p>Thanks, </p> <p>Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload