Note that there are some explanatory texts on larger screens.

plurals
  1. PORead multiple HDF5 files in Python using multiprocessing
    primarykey
    data
    text
    <p>I'm trying to read a bunch of HDF5 files ("a bunch" meaning N > 1000 files) using <code>PyTables</code> and <code>multiprocessing</code>. Basically, I create a class to read and store my data in RAM; it works perfectly fine in a sequential mode and I'd like to parallelize it to gain some performance.</p> <p>I tried a dummy approach for now, creating a new method <code>flatten()</code> to my class to parallelize file reading. The following example is a simplified example of what I'm trying to do. <code>listf</code> is a list of strings containing the name of the files to read, <code>nx</code> and <code>ny</code> are the size of the array I want to read in the file:</p> <pre><code>import numpy as np import multiprocessing as mp import tables class data: def __init__(self, listf, nx, ny, nproc=0): self.listinc = [] for i in range(len(listf)): self.listinc.append((listf[i], nx, ny)) def __del__(self): del self.listinc def get_dsets(self, tuple_inc): listf, nx, ny = tuple_inc x = np.zeros((nx, ny)) f = tables.openFile(listf) x = np.transpose(f.root.x[:ny,:nx]) f.close() return(x) def flatten(self): nproc = mp.cpu_count()*2 def worker(tasks, results): for i, x in iter(tasks.get, 'STOP'): print i, x results.put(i, self.get_dsets(x)) tasks = mp.Queue() results = mp.Queue() manager = mp.Manager() lx = manager.list() for i, out in enumerate(self.listinc): tasks.put((i, out)) for i in range(nproc): mp.Process(target=worker, args=(tasks, results)).start() for i in range(len(self.listinc)): j, res = results.get() lx.append(res) for i in range(nproc): tasks.put('STOP') </code></pre> <p>I tried different things (including, like I did in this simple example, the use of a <code>manager</code> to retrieve the data) but I always get a <code>TypeError: an integer is required</code>.</p> <p>I do not use ctypes array because I don't really require to have shared arrays (I just want to retrieve my data) and after retrieving the data, I want to play with it with NumPy.</p> <p>Any thought, hint or help would be highly appreciated!</p> <p><strong>Edit:</strong> The complete error I get is the following:</p> <pre><code>Process Process-341: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/home/toto/test/rd_para.py", line 81, in worker results.put(i, self.get_dsets(x)) File "/usr/lib/python2.7/multiprocessing/queues.py", line 101, in put if not self._sem.acquire(block, timeout): TypeError: an integer is required </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload