Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I think it's the overhead associated with distributing the individual strings to the workers and receiving the results. If I run your parallel code as given above with an example document (Dostojevski's "Crime and Punishment") it takes about 0.32 s to run, whereas the single-process version takes just 0.09 s. If I modify the <code>worker</code> function to just process the string "test" instead of the real document (still passing in the real string as an argument), the runtime goes down to 0.22 s. However, if I pass in "test" as argument to the <code>map_async</code> function, the runtime decreases to 0.06 s. Hence I'd say that in your case the runtime of the program is limited by the inter-process communication overhead.</p> <p>With the following code I get the runtime of the parallel version down to 0.08 s: First, I partition the file into a number of chunks with (almost) equal length, making sure that the boundary between individual chunks does coincide with a newline. Then, I simply pass in the length and offsets of the chunks to each worker process, let it open the file, read the chunk, process it and return the results. This seems to cause significantly less overhead than directly distributing the strings through the map_async function. For larger file sizes you should be able to see an improvement in the runtime using this code. Also, if you can tolerate small count errors, you can omit the step to determine correct chunk boundaries and just split the file into equally large chunks. In my example, this brings down the runtime to 0.04 s, making the mp code faster than the single-process one.</p> <pre><code>#coding=utf-8 import time import multiprocessing import string from collections import Counter import os for_split = [',','\n','\t','\'','.','\"','!','?','-', '~'] ignored = ['the', 'and', 'i', 'to', 'of', 'a', 'in', 'was', 'that', 'had', 'he', 'you', 'his','my', 'it', 'as', 'with', 'her', 'for', 'on'] result_list = [] def worker(offset,length,filename): origin = open(filename, 'r') origin.seek(offset) content = origin.read(length).lower() for ch in for_split: content = content.replace(ch, ' ') words = string.split(content) result = Counter(words) origin.close() return result def log_result(result): result_list.append(result) def main(): processes = 5 pool = multiprocessing.Pool(processes=processes) filename = "document.txt" file_size = os.stat(filename)[6] chunks = [] origin = open(filename, 'r') while True: lines = origin.readlines(file_size/processes) if not lines: break chunks.append("\n".join(lines)) lengths = [len(chunk) for chunk in chunks] offset = 0 for length in lengths: pool.apply_async(worker, args=(offset,length,filename,), callback = log_result) offset += length pool.close() pool.join() result = Counter() for item in result_list: result = result + item result = result.most_common(40) i=0 for word, frequency in result: if not word in ignored and i &lt; 10: print "%s : %d" % (word, frequency) i = i+1 if __name__ == "__main__": starttime = time.clock() main() print time.clock() - starttime </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload