StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POFuzzy text (sentences/titles) matching in C#
text
Body
copied!<p>Hey, I'm using <a href="http://en.wikipedia.org/wiki/Levenshtein_distance" rel="nofollow noreferrer">Levenshteins</a> algorithm to get distance between source and target string.</p> <p>also I have method which returns value from 0 to 1:</p> <pre><code>/// <summary> /// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two string are absolutely similar /// </summary> /// <param name="string1">The string1.</param> /// <param name="string2">The string2.</param> /// <returns></returns> public static float CalculateSimilarity(String s1, String s2) { if ((s1 == null) || (s2 == null)) return 0.0f; float dis = LevenshteinDistance.Compute(s1, s2); float maxLen = s1.Length; if (maxLen < s2.Length) maxLen = s2.Length; if (maxLen == 0.0F) return 1.0F; else return 1.0F - dis / maxLen; } </code></pre> <p>but this for me is not enough. Because I need more complex way to match two sentences.</p> <p>For example I want automatically tag some music, I have original song names, and i have songs with trash, like <em>super, quality,</em> years like <em>2007, 2008,</em> etc..etc.. also some files have just <a href="http://trash..thash..song_name_mp3.mp3" rel="nofollow noreferrer">http://trash..thash..song_name_mp3.mp3</a>, other are normal. I want to create an algorithm which will work just more perfect than mine now.. Maybe anyone can help me?</p> <p>here is my current algo:</p> <pre><code>/// <summary> /// if we need to ignore this target. /// </summary> /// <param name="targetString">The target string.</param> /// <returns></returns> private bool doIgnore(String targetString) { if ((targetString != null) && (targetString != String.Empty)) { for (int i = 0; i < ignoreWordsList.Length; ++i) { //* if we found ignore word or target string matching some some special cases like years (Regex). if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true; } } return false; } /// <summary> /// Removes the duplicates. /// </summary> /// <param name="list">The list.</param> private void removeDuplicates(List<String> list) { if ((list != null) && (list.Count > 0)) { for (int i = 0; i < list.Count - 1; ++i) { if (list[i] == list[i + 1]) { list.RemoveAt(i); --i; } } } } /// <summary> /// Does the fuzzy match. /// </summary> /// <param name="targetTitle">The target title.</param> /// <returns></returns> private TitleMatchResult doFuzzyMatch(String targetTitle) { TitleMatchResult matchResult = null; if (targetTitle != null && targetTitle != String.Empty) { try { //* change target title (string) to lower case. targetTitle = targetTitle.ToLower(); //* scores, we will select higher score at the end. Dictionary<Title, float> scores = new Dictionary<Title, float>(); //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_' List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); //* remove all trash from keywords, like super, quality, etc.. targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); }); //* sort keywords. targetKeywords.Sort(); //* remove some duplicates. removeDuplicates(targetKeywords); //* go through all original titles. foreach (Title sourceTitle in titles) { float tempScore = 0f; //* split orig. title to keywords list. List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); sourceKeywords.Sort(); removeDuplicates(sourceKeywords); //* go through all source ttl keywords. foreach (String keyw1 in sourceKeywords) { float max = float.MinValue; foreach (String keyw2 in targetKeywords) { float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2); if (currentScore > max) { max = currentScore; } } tempScore += max; } //* calculate average score. float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); //* if average score is bigger than minimal score and target title is not in this source title ignore list. if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle)) { //* add score. scores.Add(sourceTitle, averageScore); } } //* choose biggest score. float maxi = float.MinValue; foreach (KeyValuePair<Title, float> kvp in scores) { if (kvp.Value > maxi) { maxi = kvp.Value; matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic); } } } catch { } } //* return result. return matchResult; } </code></pre> <p>This works normally but just in some cases, a lot of titles which should match, does not match... I think I need some kind of formula to play with weights and etc, but i can't think of one.. </p> <p>Ideas? Suggestions? Algos?</p> <p>by the way I already know this topic (My colleague already posted it but we cannot come with a proper solution for this problem.): <a href="https://stackoverflow.com/questions/49263/approximate-string-matching-algorithms">Approximate string matching algorithms</a></p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload