StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>This is a complete rewrite of my previous answer. It turns out that in my previous attempts, I overlooked a much simpler method based on a combination of packed arrays and sparse arrays, that is much faster and more memory - efficient than all previous methods (at least in the range of sample sizes where I tested it), while only minimally changing the original <code>SubValues</code> - based approach. Since the question was asked about the most efficient method, I will remove the other ones from the answer (given that they are quite a bit more complex and take a lot of space. Those who would like to see them can inspect past revisions of this answer).</p> <h3>The original <code>SubValues</code> - based approach</h3> <p>We start by introducing a function to generate the test samples of configurations for us. Here it is:</p> <pre><code>Clear[generateConfigurations]; generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer, nconfs_Integer] := Transpose[{ RandomInteger[{1, maxIndex}, nconfs], Transpose[{ RandomInteger[{1, maxConfX}, nconfs], RandomInteger[{1, maxConfY}, nconfs] }]}]; </code></pre> <p>We can generate a small sample to illustrate:</p> <pre><code>In[3]:= sample = generateConfigurations[2,2,2,10] Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}}, {1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}} </code></pre> <p>We have here only 2 indices, and configurations where both "x" and "y" numbers vary from 1 to 2 only - 10 such configurations.</p> <p>The following function will help us imitate the accumulation of frequencies for configurations, as we increment <code>SubValues</code>-based counters for repeatedly occurring ones:</p> <pre><code>Clear[testAccumulate]; testAccumulate[ff_Symbol, data_] := Module[{}, ClearAll[ff]; ff[_][_] = 0; Do[ doSomeStuff; ff[#1][#2]++ & @@ elem; doSomeMoreStaff; , {elem, data}]]; </code></pre> <p>The <code>doSomeStuff</code> and <code>doSomeMoreStaff</code> symbols are here to represent some code that might preclude or follow the counting code. The <code>data</code> parameter is supposed to be a list of the form produced by <code>generateConfigurations</code>. For example:</p> <pre><code>In[6]:= testAccumulate[ff,sample]; SubValues[ff] Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3, HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1, HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1, HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0} </code></pre> <p>The following function will extract the resulting data (indices, configurations and their frequencies) from the list of <code>SubValues</code>:</p> <pre><code>Clear[getResultingData]; getResultingData[f_Symbol] := Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &@ Most@SubValues[f, Sort -> False]; </code></pre> <p>For example:</p> <pre><code>In[10]:= result = getResultingData[ff] Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1}, {2,{2,2},1},{1,{2,2},1}} </code></pre> <p>To finish with the data-processing cycle, here is a straightforward function to extract data for a fixed index, based on <code>Select</code>:</p> <pre><code>Clear[getResultsForFixedIndex]; getResultsForFixedIndex[data_, index_] := If[# === {}, {}, Transpose[#]] &[ Select[data, First@# == index &][[All, {2, 3}]]]; </code></pre> <p>For our test example,</p> <pre><code>In[13]:= getResultsForFixedIndex[result,1] Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}} </code></pre> <p>This is presumably close to what @zorank tried, in code. </p> <h3>A faster solution based on packed arrays and sparse arrays</h3> <p>As @zorank noted, this becomes slow for larger sample with more indices and configurations. We will now generate a large sample to illustrate that (<strong>note! This requires about 4-5 Gb of RAM, so you may want to reduce the number of configurations if this exceeds the available RAM</strong>):</p> <pre><code>In[14]:= largeSample = generateConfigurations[20,500,500,5000000]; testAccumulate[ff,largeSample];//Timing Out[15]= {31.89,Null} </code></pre> <p>We will now extract the full data from the <code>SubValues</code> of <code>ff</code>:</p> <pre><code>In[16]:= (largeres = getResultingData[ff]); // Timing Out[16]= {10.844, Null} </code></pre> <p>This takes some time, but one has to do this only once. But when we start extracting data for a fixed index, we see that it is quite slow:</p> <pre><code>In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291}, {256,38},{352,469}},{<<1>>}}} </code></pre> <p>The main idea we will use here to speed it up is to pack individual lists inside the <code>largeres</code>, those for indices, combinations and frequencies. While the full list can not be packed, those parts individually can:</p> <pre><code>In[18]:= Timing[ subIndicesPacked = Developer`ToPackedArray[largeres[[All,1]]]; subCombsPacked = Developer`ToPackedArray[largeres[[All,2]]]; subFreqsPacked = Developer`ToPackedArray[largeres[[All,3]]]; ] Out[18]= {1.672,Null} </code></pre> <p>This also takes some time, but it is a one-time operation again.</p> <p>The following functions will then be used to extract the results for a fixed index much more efficiently:</p> <pre><code>Clear[extractPositionFromSparseArray]; extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]] Clear[getCombinationsAndFrequenciesForIndex]; getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, packedFreqs_, index_Integer] := With[{positions = extractPositionFromSparseArray[ SparseArray[1 - Unitize[packedIndices - index]]]}, {Extract[packedCombs, positions],Extract[packedFreqs, positions]}]; </code></pre> <p>Now, we have:</p> <pre><code>In[25]:= getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10] //Short//Timing Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291}, {256,38},{352,469}},{<<1>>}}} </code></pre> <p>We get a 30 times speed-up w.r.t. the naive <code>Select</code> approach.</p> <h3>Some notes on complexity</h3> <p>Note that the second solution is faster because it uses optimized data structures, but its complexity is the same as that of <code>Select</code>- based one, which is, linear in the length of total list of unique combinations for all indices. Therefore, in theory, the previously - discussed solutions based on nested hash-table etc <em>may</em> be asymptotically better. The problem is, that in practice we will probably hit the memory limitations long before that. For the 10 million configurations sample, the above code was still 2-3 times faster than the fastest solution I posted before. </p> <p><strong>EDIT</strong></p> <p>The following modification:</p> <pre><code>Clear[getCombinationsAndFrequenciesForIndex]; getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, packedFreqs_, index_Integer] := With[{positions = extractPositionFromSparseArray[ SparseArray[Unitize[packedIndices - index], Automatic, 1]]}, {Extract[packedCombs, positions], Extract[packedFreqs, positions]}]; </code></pre> <p>makes the code twice faster still. Moreover, for more sparse indices (say, calling the sample-generation function with parameters like <code>generateConfigurations[2000, 500, 500, 5000000]</code> ), the speed-up with respect to the <code>Select</code>- based function is about <em>100</em> times. </p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload