Note that there are some explanatory texts on larger screens.

plurals
  1. POwhat is the fastest way to find similar records from different (big) data sources
    primarykey
    data
    text
    <p>I work for a public health agency that has lots of different demographic datasets--stored in SQL sever, Access and Excel. I've written an application that allows people to find 'matches' in those datasets based on different criteria, set up with a GUI. For instance, one 'match' might be that the First, Last and DOB match in both datasets--but the SSN is 'off by 1' (determined by the Levenshtein algorithm). </p> <p>These are big datasets. The matching criteria can get really complex. Right now, I find matches by pulling both datasets into data tables in memory and then going row-by-row through the first table and seeing if there are any rows in the second table that match (using LINQ). So my code looks something like: </p> <pre><code>For each table1Row in TableOne/DatasourceOne table2Options=from l in table2rows where Levenshtein(table1Row.first, l.first)&lt;2 //first name off by one table2Options=from l in table2rows where Levenshtein(table1Row.last, l.last)&lt;2 //last name off by one if table2Options.count&gt;1 then the row in table1 'matches' table 2 Next </code></pre> <p>The code produces the correct output (finds matches) but it is SLOOOW. I know that going row-by-row is supposed to be slower--but using LINQ to find all the records all at once goes even slower. </p> <pre><code>From l in table1, k in table2 where Levenshtein(l.first, k.first)&lt;2 and Levenshtein(l.last, k.last)&lt;2 select l //this takes forever because it calculates the function for l rows * k rows </code></pre> <p>Any ideas on how to do this core matching faster? </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload