Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Here's the grain of an algorithm. It certainly isn't flushed out or tested, and it may not be complete. I'm just throwing it out here as a possible starting point.</p> <p>It seems the most challenging issue is <em>time</em> required to run the algorithm over billions of rows, followed perhaps by <em>memory</em> limitations.</p> <p>I also believe the fundamental task involved in solving this problem lies in the single operation of "comparing one set of numbers with another" to locate a shared set.</p> <p>Therefore, might I suggest the following (rough) approach, in order to tackle <em>both</em> time, <em>and</em> memory:</p> <pre><code>(1) Consolidate multiple sets into a single, larger set. </code></pre> <p>i.e., take 100 consecutive sets (in your example, <code>23, 67, 34, 23, 54</code>, <code>23, 54</code>, <code>78, 96, 23</code>, and the following 97 sets), and simply merge them together into a <strong>single</strong> set (ignoring duplicates).</p> <pre><code>(2) Give each *consolidated* set from (1) a label (or index), and then map this set (by its label) to the original sets that compose it. </code></pre> <p>In this way, you will be able to retrieve (look up) the original individual sets <code>23, 67, 34, 23, 54</code>, etc.</p> <pre><code>(3) The data is now denormalized - there are a much smaller number of sets, and each set is much larger. </code></pre> <p>Now, the algorithm moves onto a new stage.</p> <pre><code>(4) Develop an algorithm to look for matching sequences between any two of these larger sets. </code></pre> <p>There will be many false positives; however, hopefully the nature of your data is that the false positives will not "ruin" the efficiency that is gained by this approach.</p> <p>I don't provide an algorithm to perform the matching between 2 individual sets here; I assume that you can come up with one yourself (sort both the sets, etc.).</p> <pre><code>(5) For every possible matching sequence found in (4), iterate through the individual sets that compose the two larger sets being compared, weeding out false positives. </code></pre> <p>I suspect that the above step could be optimized significantly, but this is the basic idea.</p> <p>At this point, you will have all of the matching sequences between all original sets that compose the two larger sets being compared.</p> <pre><code>(6) Execute steps (4) and (5) for every pair of large sets constructed in (2). </code></pre> <p>Now, you will have ALL matching sequences - with duplicates.</p> <pre><code>(7) Remove duplicates from the set of matching sequences. </code></pre> <p>Just a thought.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload