Note that there are some explanatory texts on larger screens.

plurals
  1. POBest way to detect similar email addresses?
    primarykey
    data
    text
    <p>I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as username1@gmail.com, username1a@gmail.com, username1b@gmail.com, etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?</p> <p>The test code I'm using now is:</p> <pre><code>using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.IO; using System.Threading; namespace LevenshteinAnalyzer { class Program { const string INPUT_FILE = @"C:\Input.txt"; const string OUTPUT_FILE = @"C:\Output.txt"; static void Main(string[] args) { var inputWords = File.ReadAllLines(INPUT_FILE); var outputWords = new SortedSet&lt;string&gt;(); for (var i = 0; i &lt; inputWords.Length; i++) { if (i % 100 == 0) Console.WriteLine("Processing record #" + i); var word1 = inputWords[i].ToLower(); for (var n = i + 1; n &lt; inputWords.Length; n++) { if (i == n) continue; var word2 = inputWords[n].ToLower(); if (word1 == word2) continue; if (outputWords.Contains(word1)) continue; if (outputWords.Contains(word2)) continue; var distance = LevenshteinAlgorithm.Compute(word1, word2); if (distance &lt;= 2) { outputWords.Add(word1); outputWords.Add(word2); } } } File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray()); Console.WriteLine("Found {0} words", outputWords.Count); } } } </code></pre> <p><strong>Edit:</strong> Some of the stuff I'm trying to catch looks like:</p> <p>01234567890@gmail.com<br> 0123456789@gmail.com<br> 012345678@gmail.com<br> 01234567@gmail.com<br> 0123456@gmail.com<br> 012345@gmail.com<br> 01234@gmail.com<br> 0123@gmail.com<br> 012@gmail.com </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload