StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCombinatorial Optimization Matching in SQL
text
Body
copied!<p>I am having trouble developing a matching algorithm in SQL. I have one table <code>subjects</code>. Each of these needs to be matched to the same number of rows in the table <code>controls</code> (for the sake of this question let's say two rows or controls need to be selected for each subject). The location of the selected controls must match exactly, and the controls selected should have a value in <code>match_field</code> that is as close as possible to the subjects.</p> <p>Here is some sample data:</p> <p>Table subjects:</p> <pre><code>id location match_field 1 1 190 2 2 2000 3 1 100 </code></pre> <p>Table controls:</p> <pre><code>id location match_field 17 1 70 11 1 180 12 1 220 13 1 240 14 1 500 15 1 600 16 1 600 10 2 30 78 2 1840 79 2 2250 </code></pre> <p>Here would be the optimum result from the sample data:</p> <pre><code>subject_id control_id location match_field_diff 1 12 1 30 1 13 1 50 2 78 2 160 2 79 2 250 3 17 1 30 3 11 1 80 </code></pre> <p>It gets tricky, because, for example, control 11 is the closest match to subject 1. However, in the optimum solution control 11 is matched to subject 3.</p> <p>I believe the <a href="http://en.wikipedia.org/wiki/Hungarian_algorithm" rel="nofollow">Hungarian Algorithm</a> is close to the "correct" solution to this problem. However, there is not an equal number of subjects and controls, nor will all controls be used (I have a few thousand subjects and a few million potential controls).</p> <p>It is not necessary to obtain the absolute optimum results; a pretty good approximation would be fine with me.</p> <p>It seems that there should be a nice set-based solution to this problem, but I can't think of how to do it. Here is some code that assigns an equal number of controls to each subject based on location only:</p> <pre><code>select * from ( select subject.id, control.id, subject.location, row_number() over ( partition by subject.location order by subject.id, control.id ) as rn, count(distinct control.id) over ( partition by subject.location ) as controls_in_loc from subjects join controls on control.location = subject.location ) where mod(rn,controls_in_loc+1) = 1 </code></pre> <p>However, I can't figure out how to add the fuzzy matching component. I am using DB2 but can convert an algorithm into DB2 if you are using something else.</p> <p>Thanks in advance for your help!</p> <p><strong>Update:</strong> I am mostly convinced that SQL is not the right tool for this job. However, just to be sure (and because it is an interesting problem), I am offering a bounty to see if a working SQL solution is possible. It needs to be a set-based solution. It can use iteration (looping over the same query multiple times to achieve the result) but number of iterations needs to be far less than the number of rows for a large table. It should not loop over each element in the table or use cursors.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload