Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.</p> <p>Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records. The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:</p> <ol> <li>at least <em>O(N*logN)</em> 'len' calls and comparisons for sorting with an effective algorithm</li> <li><em>O(N)</em> checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)</li> </ol> <p>Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).</p> <p>It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:</p> <pre><code>for row in data.itervalues(): np_row = numpy.array(row) this_row_std = numpy.std(np_row) # compute any other statistic descriptors needed and then save to some list </code></pre> <p>In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.</p> <hr> <p>A generalized variant of what was suggested by Mapad:</p> <pre><code>from numpy import array, mean, std def get_statistical_descriptors(a): if ax = len(shape(a))-1 functions = [mean, std] return f(a, axis = ax) for f in functions def process_long_list_stats(data): import numpy groups = {} for key, row in data.iteritems(): size = len(row) try: groups[size].append(key) except KeyError: groups[size] = ([key]) results = [] for gr_keys in groups.itervalues(): gr_rows = numpy.array([data[k] for k in gr_keys]) stats = get_statistical_descriptors(gr_rows) results.extend( zip(gr_keys, zip(*stats)) ) return dict(results) </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload