Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.</p> <p>If the rules are not completely general I see 3 solutions to optimize different cases:</p> <ul> <li><p><strong>if comparison is transitive and you can calculate hash</strong> (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.</p></li> <li><p><strong>if entities are sortable</strong>, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.</p></li> <li><p><strong>if things are worth</strong>, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 &lt; Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.</p></li> </ul> <p>By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload