StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Insufficiently tested, use at own risk.</p> <pre><code>import numpy a = numpy.random.random(100) # a_by_a[i,j] = a[i] > a[j] a_by_a = a[numpy.newaxis,:] > a[:,numpy.newaxis] # by taking the upper triangular, we ignore all cases where i < j a_by_a = numpy.triu(a_by_a) # argmax will give the first index with the highest value (1 in this case) print numpy.argmax(a_by_a, axis = 1) </code></pre> <p>Lower memory version:</p> <pre><code>a = numpy.random.random(100) @numpy.vectorize def values(i): try: return (a[i:] > a[i]).nonzero()[0][0] + i except IndexError: return -1 # no valid values found b = values(numpy.arange(100)) </code></pre> <p>Faster version:</p> <pre><code>@np.vectorize def values(i): try: return next(idx for idx, value in enumerate(A[i+1:]) if value > A[i]) + i + 1 except StopIteration: return -1 # no valid values found return values(np.arange(len(A))) </code></pre> <p>Even faster version:</p> <pre><code>def future6(A): # list of tuples (index into A, value in A) # invariant: indexes and values in sorted order known = [] result = [] for idx in xrange(len(A) - 1, -1, -1): value = A[idx] # since known is sorted a binary search could be applied here # I haven't bothered # anything lower then the current value # cannot possibly be used again, since this value will be first index instead # of those known = [(x,y) for x,y in known if y > value] if known: # all values in known are > current value # they reverse sorted by index # the value with the lowest index is first result.append(known[-1][0]) else: # no values exist this high, report -1 result.append(-1) # add to end of the list to maintain invariant known.append( (idx, value) ) # let numpy worry about reversing the array return np.array(result)[::-1] </code></pre> <p>Credit to cyborg for some of the ideas used here. </p> <p><strong>Algorithm Difference</strong></p> <p>cyborg showed significant differences between the various algorithms depending on the data fed into them. I gathered some stats from the running algorithms to see if I could figure out what is going on.</p> <p><em>Random data:</em></p> <pre><code>Average distance between value and its target: 9 Average length of ascends list: 24 Average length of segment in ascends list: 1.33 Average length of known list: 9.1 </code></pre> <p>Since the lists are really short, the ascents algorithm mostly decays to a linear search. It does clear out ascents which cannot be used in the future, so its still noticably better then a linear search.</p> <p><em>Oscillating:</em></p> <pre><code>Average distance between value and its target: 31.46 Average length of ascends list: 84 Average length of segment in ascends list: 1.70 Average length of known list: 57.98 </code></pre> <p>The oscillations tend to put the different pieces further apart. This naturally hampers the linear search algorithm. Both of the "smarter" algorithms have to keep track of additional data. My algorithm decays considerably since it scans over the data each time. The ascends algorithm touch less of its data and does better.</p> <p><em>Ascending:</em></p> <pre><code>Average distance between value and its target: 2.57 Average length of ascends list: 40 Average length of segment in ascends list: 3.27 Average length of known list: 3037.97 </code></pre> <p>Its clear why my algorithm has issues, it has to track a huge number of ascending values. The short distance between the value and target explains the good performance of a linear search. Ascends is still not working with very long segments. </p> <p><strong>Better algorithms</strong></p> <p>There is no reason for my algorithm to have to a linear search over the data. It is sorted and we just need remove to small values from the end of the list.</p> <pre><code>def future6(A): # list of tuples (index into A, value in A) # invariant: indexes and values in sorted order known = [] result = [] for idx in xrange(len(A) - 1, -1, -1): value = A[idx] # since known is sorted a binary search could be applied here # I haven't bothered # anything lower then the current value # cannot possibly be used again, since this value will be first index instead # of those while known and known[-1][1] < value: known.pop() if known: # all values in known are > current value # they reverse sorted by index # the value with the lowest index is first result.append(known[-1][0]) else: # no values exist this high, report -1 result.append(-1) # add to end of the list to maintain invariant known.append( (idx, value) ) # let numpy worry about reversing the array return np.array(result)[::-1] </code></pre> <p>But it occurs to me that we can reuse previous calculated values of B instead of constructing new data structures. If j > i, A[i] > A[j] then B[i] > B[j]. </p> <pre><code>def future8(A): B = [-1] * len(A) for index in xrange(len(A)-2, -1, -1): target = index + 1 value = A[index] while target != -1 and A[target] < value: target = B[target] B[index] = target return np.array(B) </code></pre> <p>My benchmark results:</p> <pre><code>Random series: future2 ascends : 0.242569923401 future6 full list: 0.0363488197327 future7 vectorize: 0.129994153976 future8 reuse: 0.0299410820007 Oscillating series: future2 ascends : 0.233623981476 future6 full list: 0.0360488891602 future7 vectorize: 1.19140791893 future8 reuse: 0.0297570228577 Ascending trend series: future2 ascends : 0.120707035065 future6 full list: 0.0314049720764 future7 vectorize: 0.0640320777893 future8 reuse: 0.0246520042419 </code></pre> <p><strong>Ascending Sections</strong></p> <p>Cyborg had the very interesting idea to take advantage of ascending segments. I don't think any of his test cases really exhibit the behaviour he was after there. I don't think the sections were long enough to take advantage of. But I figure real data may well have such sections so taking advantage of it would be really helpful.</p> <p>But I don't think it'll work. It takes O(n) time to prepare the necessary data to do the binary search. That would be fine if we do the binary search many times, but once we find a value in the middle of the ascending section, we never revisit anything to the left. As a result, even with a binary search we spend at most O(n) time processing the data.</p> <p>It could work if its less expensive to build the neccesary data then to do scan over the ascending section later. But the scan is pretty cheap and you'll be hard pressed to come up with a way of handling ascending sections that's less expensive.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload