Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>This is a complete rewrite of my previous answer. It turns out that in my previous attempts, I overlooked a much simpler method based on a combination of packed arrays and sparse arrays, that is much faster and more memory - efficient than all previous methods (at least in the range of sample sizes where I tested it), while only minimally changing the original <code>SubValues</code> - based approach. Since the question was asked about the most efficient method, I will remove the other ones from the answer (given that they are quite a bit more complex and take a lot of space. Those who would like to see them can inspect past revisions of this answer).</p> <h3>The original <code>SubValues</code> - based approach</h3> <p>We start by introducing a function to generate the test samples of configurations for us. Here it is:</p> <pre><code>Clear[generateConfigurations]; generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer, nconfs_Integer] := Transpose[{ RandomInteger[{1, maxIndex}, nconfs], Transpose[{ RandomInteger[{1, maxConfX}, nconfs], RandomInteger[{1, maxConfY}, nconfs] }]}]; </code></pre> <p>We can generate a small sample to illustrate:</p> <pre><code>In[3]:= sample = generateConfigurations[2,2,2,10] Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}}, {1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}} </code></pre> <p>We have here only 2 indices, and configurations where both "x" and "y" numbers vary from 1 to 2 only - 10 such configurations.</p> <p>The following function will help us imitate the accumulation of frequencies for configurations, as we increment <code>SubValues</code>-based counters for repeatedly occurring ones:</p> <pre><code>Clear[testAccumulate]; testAccumulate[ff_Symbol, data_] := Module[{}, ClearAll[ff]; ff[_][_] = 0; Do[ doSomeStuff; ff[#1][#2]++ &amp; @@ elem; doSomeMoreStaff; , {elem, data}]]; </code></pre> <p>The <code>doSomeStuff</code> and <code>doSomeMoreStaff</code> symbols are here to represent some code that might preclude or follow the counting code. The <code>data</code> parameter is supposed to be a list of the form produced by <code>generateConfigurations</code>. For example:</p> <pre><code>In[6]:= testAccumulate[ff,sample]; SubValues[ff] Out[7]= {HoldPattern[ff[1][{1,2}]]:&gt;2,HoldPattern[ff[1][{2,1}]]:&gt;3, HoldPattern[ff[1][{2,2}]]:&gt;1,HoldPattern[ff[2][{1,1}]]:&gt;1, HoldPattern[ff[2][{1,2}]]:&gt;1,HoldPattern[ff[2][{2,1}]]:&gt;1, HoldPattern[ff[2][{2,2}]]:&gt;1,HoldPattern[ff[_][_]]:&gt;0} </code></pre> <p>The following function will extract the resulting data (indices, configurations and their frequencies) from the list of <code>SubValues</code>:</p> <pre><code>Clear[getResultingData]; getResultingData[f_Symbol] := Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &amp;@ Most@SubValues[f, Sort -&gt; False]; </code></pre> <p>For example:</p> <pre><code>In[10]:= result = getResultingData[ff] Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1}, {2,{2,2},1},{1,{2,2},1}} </code></pre> <p>To finish with the data-processing cycle, here is a straightforward function to extract data for a fixed index, based on <code>Select</code>:</p> <pre><code>Clear[getResultsForFixedIndex]; getResultsForFixedIndex[data_, index_] := If[# === {}, {}, Transpose[#]] &amp;[ Select[data, First@# == index &amp;][[All, {2, 3}]]]; </code></pre> <p>For our test example,</p> <pre><code>In[13]:= getResultsForFixedIndex[result,1] Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}} </code></pre> <p>This is presumably close to what @zorank tried, in code. </p> <h3>A faster solution based on packed arrays and sparse arrays</h3> <p>As @zorank noted, this becomes slow for larger sample with more indices and configurations. We will now generate a large sample to illustrate that (<strong>note! This requires about 4-5 Gb of RAM, so you may want to reduce the number of configurations if this exceeds the available RAM</strong>):</p> <pre><code>In[14]:= largeSample = generateConfigurations[20,500,500,5000000]; testAccumulate[ff,largeSample];//Timing Out[15]= {31.89,Null} </code></pre> <p>We will now extract the full data from the <code>SubValues</code> of <code>ff</code>:</p> <pre><code>In[16]:= (largeres = getResultingData[ff]); // Timing Out[16]= {10.844, Null} </code></pre> <p>This takes some time, but one has to do this only once. But when we start extracting data for a fixed index, we see that it is quite slow:</p> <pre><code>In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},&lt;&lt;157674&gt;&gt;,{31,305},{240,291}, {256,38},{352,469}},{&lt;&lt;1&gt;&gt;}}} </code></pre> <p>The main idea we will use here to speed it up is to pack individual lists inside the <code>largeres</code>, those for indices, combinations and frequencies. While the full list can not be packed, those parts individually can:</p> <pre><code>In[18]:= Timing[ subIndicesPacked = Developer`ToPackedArray[largeres[[All,1]]]; subCombsPacked = Developer`ToPackedArray[largeres[[All,2]]]; subFreqsPacked = Developer`ToPackedArray[largeres[[All,3]]]; ] Out[18]= {1.672,Null} </code></pre> <p>This also takes some time, but it is a one-time operation again.</p> <p>The following functions will then be used to extract the results for a fixed index much more efficiently:</p> <pre><code>Clear[extractPositionFromSparseArray]; extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]] Clear[getCombinationsAndFrequenciesForIndex]; getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, packedFreqs_, index_Integer] := With[{positions = extractPositionFromSparseArray[ SparseArray[1 - Unitize[packedIndices - index]]]}, {Extract[packedCombs, positions],Extract[packedFreqs, positions]}]; </code></pre> <p>Now, we have:</p> <pre><code>In[25]:= getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10] //Short//Timing Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},&lt;&lt;157674&gt;&gt;,{31,305},{240,291}, {256,38},{352,469}},{&lt;&lt;1&gt;&gt;}}} </code></pre> <p>We get a 30 times speed-up w.r.t. the naive <code>Select</code> approach.</p> <h3>Some notes on complexity</h3> <p>Note that the second solution is faster because it uses optimized data structures, but its complexity is the same as that of <code>Select</code>- based one, which is, linear in the length of total list of unique combinations for all indices. Therefore, in theory, the previously - discussed solutions based on nested hash-table etc <em>may</em> be asymptotically better. The problem is, that in practice we will probably hit the memory limitations long before that. For the 10 million configurations sample, the above code was still 2-3 times faster than the fastest solution I posted before. </p> <p><strong>EDIT</strong></p> <p>The following modification:</p> <pre><code>Clear[getCombinationsAndFrequenciesForIndex]; getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, packedFreqs_, index_Integer] := With[{positions = extractPositionFromSparseArray[ SparseArray[Unitize[packedIndices - index], Automatic, 1]]}, {Extract[packedCombs, positions], Extract[packedFreqs, positions]}]; </code></pre> <p>makes the code twice faster still. Moreover, for more sparse indices (say, calling the sample-generation function with parameters like <code>generateConfigurations[2000, 500, 500, 5000000]</code> ), the speed-up with respect to the <code>Select</code>- based function is about <em>100</em> times. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. COI won't pretend I fully understand what the newest code is doing, but I have a question. What's the purpose of `getValues` inside the `Module`? I see that it is a modification of an external variable, but I don't see what it's supposed to be doing.
      singulars
    2. CO@rcollyer The newest code is actually not hard at all, I just wasn't able to explain it well. All you do is to take the idiom of building a linked list, and insert it inside a pure function, to which `f[index]` will evaluate for all subsequent calls. I use that heads are computed before anything else, so when the first time we hit a given index in expression like `f[5][{{1,2},3}]`, the memoized code is executed, and `f[5]` is memoized as a pure function. We then evaluate `Function[...][{{1,2},3}]`, and the same for the subsequent calls to `f[5]`. The next question is how to make the pure ...
      singulars
    3. CO@rcollyer ... function modify the linked list. The answer is to embed the linked-list building code in it. The next question is how to localize the `storage` variable, so that it is not exposed to the user *and* a new variable is created for every new index. The answer is to wrap memoization code in `Module`. The final question is how to get the value of our local variable `storage` at the end -since it is not exposed. The answer is to define a global accessor function (getter method) at the memoization time - this is why we need `getValues`. It gives access to, but can not modify, `storage`.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload