Note that there are some explanatory texts on larger screens.

plurals
  1. POSpeed up sampling of kernel estimate
    primarykey
    data
    text
    <p>Here's a <code>MWE</code> of a much larger code I'm using. Basically, it performs a Monte Carlo integration over a KDE (<a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html" rel="nofollow noreferrer">kernel density estimate</a>) for all values located below a certain threshold (the integration method was suggested over at this question BTW: <a href="https://stackoverflow.com/questions/17822282/integrate-2d-kernel-density-estimate?rq=1">Integrate 2D kernel density estimate</a>).</p> <pre><code>import numpy as np from scipy import stats import time # Generate some random two-dimensional data: def measure(n): "Measurement model, return two coupled measurements." m1 = np.random.normal(size=n) m2 = np.random.normal(scale=0.5, size=n) return m1+m2, m1-m2 # Get data. m1, m2 = measure(20000) # Define limits. xmin = m1.min() xmax = m1.max() ymin = m2.min() ymax = m2.max() # Perform a kernel density estimate on the data. x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] values = np.vstack([m1, m2]) kernel = stats.gaussian_kde(values) # Define point below which to integrate the kernel. x1, y1 = 0.5, 0.5 # Get kernel value for this point. tik = time.time() iso = kernel((x1,y1)) print 'iso: ', time.time()-tik # Sample from KDE distribution (Monte Carlo process). tik = time.time() sample = kernel.resample(size=1000) print 'resample: ', time.time()-tik # Filter the sample leaving only values for which # the kernel evaluates to less than what it does for # the (x1, y1) point defined above. tik = time.time() insample = kernel(sample) &lt; iso print 'filter/sample: ', time.time()-tik # Integrate for all values below iso. tik = time.time() integral = insample.sum() / float(insample.shape[0]) print 'integral: ', time.time()-tik </code></pre> <p>The output looks something like this:</p> <pre><code>iso: 0.00259208679199 resample: 0.000817060470581 filter/sample: 2.10829401016 integral: 4.2200088501e-05 </code></pre> <p>which clearly means that the <em>filter/sample</em> call is consuming almost all of the time the code uses to run. I have to run this block of code iteratively several thousand times so it can get pretty time consuming.</p> <p>Is there any way to speed up the filtering/sampling process?</p> <hr> <h2>Add</h2> <p>Here's a slightly more realistic <code>MWE</code> of my actual code with Ophion's multi-threading solution written into it:</p> <pre><code>import numpy as np from scipy import stats from multiprocessing import Pool def kde_integration(m_list): m1, m2 = [], [] for item in m_list: # Color data. m1.append(item[0]) # Magnitude data. m2.append(item[1]) # Define limits. xmin, xmax = min(m1), max(m1) ymin, ymax = min(m2), max(m2) # Perform a kernel density estimate on the data: x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] values = np.vstack([m1, m2]) kernel = stats.gaussian_kde(values) out_list = [] for point in m_list: # Compute the point below which to integrate. iso = kernel((point[0], point[1])) # Sample KDE distribution sample = kernel.resample(size=1000) #Create definition. def calc_kernel(samp): return kernel(samp) #Choose number of cores and split input array. cores = 4 torun = np.array_split(sample, cores, axis=1) #Calculate pool = Pool(processes=cores) results = pool.map(calc_kernel, torun) #Reintegrate and calculate results insample_mp = np.concatenate(results) &lt; iso # Integrate for all values below iso. integral = insample_mp.sum() / float(insample_mp.shape[0]) out_list.append(integral) return out_list # Generate some random two-dimensional data: def measure(n): "Measurement model, return two coupled measurements." m1 = np.random.normal(size=n) m2 = np.random.normal(scale=0.5, size=n) return m1+m2, m1-m2 # Create list to pass. m_list = [] for i in range(60): m1, m2 = measure(5) m_list.append(m1.tolist()) m_list.append(m2.tolist()) # Call KDE integration function. print 'Integral result: ', kde_integration(m_list) </code></pre> <p>The solution presented by <em>Ophion</em> works great on the original code I presented, but fails with the following error in this version:</p> <pre><code>Integral result: Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 504, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks put(task) PicklingError: Can't pickle &lt;type 'function'&gt;: attribute lookup __builtin__.function failed </code></pre> <p>I tried moving the <code>calc_kernel</code> function around since one of the answers in this question <a href="https://stackoverflow.com/questions/3288595/multiprocessing-using-pool-map-on-a-function-defined-in-a-class">Multiprocessing: How to use Pool.map on a function defined in a class?</a> states that <em>"the function that you give to map() must be accessible through an import of your module"</em>; but I still can't get this code to work.</p> <p>Any help will be very much appreciated.</p> <hr> <h2>Add 2</h2> <p>Implementing <em>Ophion's</em> suggestion to remove the <code>calc_kernel</code> function and simply using:</p> <pre><code>results = pool.map(kernel, torun) </code></pre> <p>works to get rid of the <code>PicklingError</code> but now I see that if I create an initial <code>m_list</code> of anything more than around 62-63 items I get this error:</p> <pre><code>Traceback (most recent call last): File "~/gauss_kde_temp.py", line 67, in &lt;module&gt; print 'Integral result: ', kde_integration(m_list) File "~/gauss_kde_temp.py", line 38, in kde_integration pool = Pool(processes=cores) File "/usr/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool return Pool(processes, initializer, initargs, maxtasksperchild) File "/usr/lib/python2.7/multiprocessing/pool.py", line 161, in __init__ self._result_handler.start() File "/usr/lib/python2.7/threading.py", line 494, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread </code></pre> <p>Since my actual list in my real implementation of this code can have up to 2000 items, this issue renders the code unusable. Line <code>38</code> is this one:</p> <pre><code>pool = Pool(processes=cores) </code></pre> <p>so apparently it has something to do with the number of cores I'm using?</p> <p>This question <a href="https://stackoverflow.com/questions/10589766/cant-start-a-new-thread-error-in-python">&quot;Can&#39;t start a new thread error&quot; in Python</a> suggests using:</p> <pre><code>threading.active_count() </code></pre> <p>to check the number of threads I have going when I get that error. I checked and it always crashes when it reaches <code>374</code> threads. How can I code around this issue?</p> <hr> <p>Here's the new question dealing with this last issue: <a href="https://stackoverflow.com/questions/18602236/thread-error-cant-start-new-thread">Thread error: can&#39;t start new thread</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload