Note that there are some explanatory texts on larger screens.

plurals
  1. POSplitting a set of object into several subsets according to certain evaluation
    primarykey
    data
    text
    <p>Suppose I have a set of objects, <code>S</code>. There is an algorithm <code>f</code> that, given a set <code>S</code> builds certain data structure <code>D</code> on it: <code>f(S) = D</code>. If <code>S</code> is large and/or contains vastly different objects, <code>D</code> becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split <code>S</code> into several non-intersecting subsets: <code>S = S1 + S2 + ... + Sn</code> and build <code>Di</code> for each subset. Using <code>n</code> structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of <code>f(S)</code> grows faster than <code>S</code> itself, combined size of <code>Di</code> is much less than size of <code>D</code>.</p> <p>However, it is still desirable to reduce <code>n</code>, i.e. the number of subsets; or reduce the combined size of <code>Di</code>. For this, I need to split <code>S</code> in such a way that each <code>Si</code> contains "similar" objects, because then <code>f</code> will produce a smaller output structure if input objects are "similar enough" to each other.</p> <p>The problems is that while "similarity" of objects in <code>S</code> and size of <code>f(S)</code> do correlate, there is no way to compute the latter other than just evaluating <code>f(S)</code>, and <code>f</code> is not quite fast.</p> <p>Algorithm I have currently is to iteratively add each next object from <code>S</code> into one of <code>Si</code>, so that this results in the least possible (at this stage) increase in combined <code>Di</code> size:</p> <pre><code>for x in S: i = such i that size(f(Si + {x})) - size(f(Si)) is min Si = Si + {x} </code></pre> <p>This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is <em>slow</em>. To speed up somewhat, I compute <code>size(f(Si + {x})) - size(f(Si))</code> only for those <code>i</code> where <code>x</code> is "similar enough" to objects already in <code>Si</code>.</p> <p>Is there any standard approach to such kinds of problems?</p> <p>I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of <code>S</code> into <code>Si</code> in reasonable time. But is there some common iteratively improving algorithm?</p> <p><strong>EDIT:</strong></p> <p>As comments noted, I never defined "similarity". In fact, all I want is to split in such subsets <code>Si</code> that combined size of <code>Di = f(Si)</code> is minimal or at least small enough. "Similarity" is defined only as this and unfortunately simply cannot be computed easily. I do have a simple approximation, but it is only that — an approximation.</p> <p>So, what I need is a (likely heuristical) algorithm that minimizes <code>sum f(Si)</code> given that there is <em>no</em> simple way to compute the latter — only approximations I use to throw away cases that are very unlikely to give good results.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload