StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I've tackled a few problems that are something like this.</p> <p>First, if you don't need an exact answer, I'd generally suggest looking into something like a genetic algorithm, or doing a greedy algorithm. It won't be right, but it won't generally be bad either. And it will be much faster than an exact algorithm. For instance you can start with all points as cache points, then find the point which reduces your cost most from making it a non-caching point. Continue until removing the next one makes the cost goes up, and use that as your solution. This won't be best. It will generally be reasonably good.</p> <p>If you <em>do</em> need an exact answer, you will need to brute force search of a lot of data. Assuming that the initial cache point is specified, you'll have 2<sup>24</sup> = 16,777,216 possible sets of cache points to search. That is expensive.</p> <p>The trick to doing it more cheaply (note, not cheaply, just more cheaply) is finding ways to prune your search. Take to heart the fact that if doing 100 times as much work on each set you look at lets you remove an average of 10 points from consideration as cache points, then your overall algorithm will visit 0.1% as many sets, and your code will run 10 times faster. Therefore it is worth putting a surprising amount of energy into pruning early and often, even if the pruning step is fairly expensive.</p> <p>Often you want multiple pruning strategies. One of them is usually "the best we can do from here is worst than the best we have found previously." This works better if you've already found a pretty good best solution. Therefore it is often worth a bit of effort to do some local optimization in your search for solutions.</p> <p>Typically these optimizations won't change the fact that you are doing a tremendous amount of work. But they do let you do orders of magnitude less work.</p> <p>My initial try at this would take advantage of the following observations.</p> <ol> <li>Suppose that <code>x</code> is a cache point, and <code>y</code> is its nearest caching neighbor. Then you can always make some path from <code>x</code> to <code>y</code> cache "for free" if you just route the cache update traffic from <code>x</code> to <code>y</code> along that path. Therefore without loss of generality the set of cache points is connected on the grid.</li> <li>If the minimum cost would could wind up with exceeds the current best cost we have found, we are not on our way to a global solution.</li> <li>As soon as the sum of the access rate from all points at distance greater than 1 from the cache points plus the highest access frequency of a neighbor to the cache point that you can still use is less than the cache frequency, adding more cache points is always going to be a loss. (This would be an "expensive condition that lets us stop 10 minutes early.")</li> <li>The highest access neighbor of the current set of cache points is a reasonable candidate for the next cache point to try. (There are several other heuristics you can try, but this one is reasonable.)</li> <li>Any point whose total access frequency exceeds the cache frequency absolutely must be a caching point.</li> </ol> <p>This might not be the best set of observations to use. But it is likely to be pretty reasonable. To take advantage of this you'll need at least one data structure you might not be familiar with. If you don't know what a <a href="http://en.wikipedia.org/wiki/Priority_queue" rel="noreferrer">priority queue</a> is, then look around for an efficient one in your language of choice. If you can't find one, a <a href="http://en.wikipedia.org/wiki/Heap_%28data_structure%29" rel="noreferrer">heap</a> is pretty easy to implement and works pretty well as a priority queue.</p> <p>With that in mind, assuming that you have been given the information you've described and an initial cache node <code>P</code>, here is pseudo-code for an algorithm to find the best.</p> <pre><code># Data structures to be dynamically maintained: # AT[x, n] - how many accesses x needs that currently need to go distance n. # D[x] - The distance from x to the nearest cache node. # CA[x] - Boolean yes/no for whether x is a cache node. # B[x] - Boolean yes/no for whether x is blocked from being a cache node. # cost - Current cost # distant_accesses - The sum of the total number of accesses made from more than # distance 1 from the cache nodes. # best_possible_cost - C * nodes_in_cache + sum(min(total accesses, C) for non-cache nodes) # *** Sufficient data structures to be able to unwind changes to all of the above before # returning from recursive calls (I won't specify accesses to them, but they need to # be there) # best_cost - The best cost found. # best_solution - The best solution found. initialize all of those data structures (including best) create neighbors priority queue of neighbors of root cache node (ordered by accesses) call extend_current_solution(neighbors) do what we want with the best solution function extend_current_solution (available_neighbors): if cost < best_cost: best_cost = cost best_solution = CA # current set of cache nodes. if best_cost < best_possible_cost return # pruning time neighbors = clone(available_neighbors) while neighbors: node = remove best from neighbors if distant_accesses + accesses(node) < C: return # this is condition 3 above make node in cache set - add it to CA - update costs - add its immediate neighbors to neighbors call extend_current_solution unwind changes just made make node in blocked set call extend_current_solution unwind changes to blocked set return </code></pre> <p>It will take a lot of work to write this, and you'll need to be careful to maintain all data structures. But my bet is that - despite how heavyweight it looks - you'll find that it prunes your search space enough to run more quickly than your existing solution. (It still won't be snappy.)</p> <p>Good luck!</p> <h1>Update</h1> <p>When I thought about this more, I realized that a better observation is to note that if you can cut the "not a cache node, not a blocked node" set into two pieces, then you can solve those pieces independently. Each of those sub problems is orders of magnitude faster to solve than the whole problem, so seek to do so as fast as possible.</p> <p>A good heuristic to do that is to follow the following rules:</p> <ol> <li>While no edge has been reached: <ol> <li>Drive towards the closest edge. Distance is measured by how short the shortest path is along the non-cache, non-blocked set.</li> <li>If two edges are equidistant, break ties according to the following preference order: <code>(1, x), (x, 1), (5, x), (x, 5)</code>.</li> <li>Break any remaining ties according to preferring to drive towards the center of an edge.</li> <li>Break any remaining ties randomly.</li> </ol></li> <li>While an edge has been reached and your component still has edges that could become cache pieces: <ol> <li>If you can immediately move into an edge and split the edge pieces into two components, do so. Both for "edge in cache set" and "edge not in cache set" you'll get 2 independent subproblems that are more tractable.</li> <li>Else move on a shortest path towards the piece in the middle of your section of edge pieces.</li> <li>If there is a tie, break it in favor of whatever makes the line from the added piece to the added cache element as close to diagonal as possible.</li> <li>Break any remaining ties randomly.</li> </ol></li> <li>If you fall through here, choose randomly. (You should have a pretty small subproblem at this point. No need to be clever.)</li> </ol> <p>If you try this starting out with <code>(3, 3)</code> as a cache point, you'll find that in the first few decisions you'll find that 7/16 of the time you manage to cut into two even problems, 1/16 of the time you block in the cache point and finish, 1/4 of the time you manage to cut out a 2x2 block into a separate problem (making the overall solution run 16 times faster for that piece) and 1/4 of the time you wind up well on your way towards a solution that is on its way towards either being boxed in (and quickly exhausted), or else being a candidate for a solution with a lot of cache points that gets pruned for being on track to being a bad solution.</p> <p>I won't give pseudo-code for this variation. It will have a lot of similarities to what I had above, with a number of important details to handle. But I would be willing to bet money that in practice it will run orders of magnitude faster than your original solution.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload