Note that there are some explanatory texts on larger screens.

plurals
  1. POSuggestions needed for optimizing O(n^2) algorithm
    text
    copied!<p>I am looking to optimize a fairly simple algorithm that is currently <em>O(n<sup>2</sup>)</em>. I have a file of records, where each one needs to be compared to every other in the same file. If the two are the 'same' (comparator function is pretty complicated), the matched records are output. Note that there may be several records that match each other, <b><em>and there is no sense of order - only if the match is True or False.</em></b></p> <p>Pseudo code:</p> <pre><code> For (outRec in sourceFile) { Get new filePointer for targetFile //starting from the top of the file for inner loop For (inRec in targetFile) { if (compare(outRec, inRec) == TRUE ) { write outRec write inRec } increment some counters } increment some other counters } </code></pre> <p>The data is not sorted in any way, and there is no preprocessing possible to order the data.</p> <p>Any ideas on how this could become something less than <em>O(n<sup>2</sup>)</em>? I am thinking of applying the MapReduce paradigm on the code, breaking up the outer AND inner loops, possibly using a chained Map function. I am pretty sure I have the code figured out on Hadoop, but wanted to check alternatives before I spent time coding it.</p> <p>Suggestions appreciated!</p> <p>Added: Record types. Basically, I need to match names/strings. The types of matching are shown in the example below.</p> <pre><code> 1,Joe Smith,Daniel Foster<br> 2,Nate Johnson,Drew Logan<br> 3,Nate Johnson, Jack Crank<br> 4,Joey Smyth,Daniel Jack Foster<br> 5,Joe Morgan Smith,Daniel Foster<br> <br> Expected output: Records 1,4,5 form a match set End of output </code></pre> <p><b>Added: these files will be quite large. The largest file is expected to be around 200 million records.</b></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload