StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
18300956
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2013-08-18T15:55:08.433
FavoriteCount
0
LastActivityDate
2013-08-18T20:30:33.417
LastEditDate
2013-08-18T20:30:33.417
LastEditorUserId
654162
OwnerUserId
654162
ParentId
18300785
PostTypeId
2
Score
4
ViewCount
0
LastEditorDisplayName
text
Body
<p>I think it's the overhead associated with distributing the individual strings to the workers and receiving the results. If I run your parallel code as given above with an example document (Dostojevski's "Crime and Punishment") it takes about 0.32 s to run, whereas the single-process version takes just 0.09 s. If I modify the <code>worker</code> function to just process the string "test" instead of the real document (still passing in the real string as an argument), the runtime goes down to 0.22 s. However, if I pass in "test" as argument to the <code>map_async</code> function, the runtime decreases to 0.06 s. Hence I'd say that in your case the runtime of the program is limited by the inter-process communication overhead.</p> <p>With the following code I get the runtime of the parallel version down to 0.08 s: First, I partition the file into a number of chunks with (almost) equal length, making sure that the boundary between individual chunks does coincide with a newline. Then, I simply pass in the length and offsets of the chunks to each worker process, let it open the file, read the chunk, process it and return the results. This seems to cause significantly less overhead than directly distributing the strings through the map_async function. For larger file sizes you should be able to see an improvement in the runtime using this code. Also, if you can tolerate small count errors, you can omit the step to determine correct chunk boundaries and just split the file into equally large chunks. In my example, this brings down the runtime to 0.04 s, making the mp code faster than the single-process one.</p> <pre><code>#coding=utf-8 import time import multiprocessing import string from collections import Counter import os for_split = [',','\n','\t','\'','.','\"','!','?','-', '~'] ignored = ['the', 'and', 'i', 'to', 'of', 'a', 'in', 'was', 'that', 'had', 'he', 'you', 'his','my', 'it', 'as', 'with', 'her', 'for', 'on'] result_list = [] def worker(offset,length,filename): origin = open(filename, 'r') origin.seek(offset) content = origin.read(length).lower() for ch in for_split: content = content.replace(ch, ' ') words = string.split(content) result = Counter(words) origin.close() return result def log_result(result): result_list.append(result) def main(): processes = 5 pool = multiprocessing.Pool(processes=processes) filename = "document.txt" file_size = os.stat(filename)[6] chunks = [] origin = open(filename, 'r') while True: lines = origin.readlines(file_size/processes) if not lines: break chunks.append("\n".join(lines)) lengths = [len(chunk) for chunk in chunks] offset = 0 for length in lengths: pool.apply_async(worker, args=(offset,length,filename,), callback = log_result) offset += length pool.close() pool.join() result = Counter() for item in result_list: result = result + item result = result.most_common(40) i=0 for word, frequency in result: if not word in ignored and i < 10: print "%s : %d" % (word, frequency) i = i+1 if __name__ == "__main__": starttime = time.clock() main() print time.clock() - starttime </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POpython top N word count, why multiprocess slower then single process
  singulars
  PostTypePostTypeId
  PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USThePhysicist
UserOwnerUserId
1. USThePhysicist
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POpython top N word count, why multiprocess slower then single process
  singulars
  PostTypePostTypeId
  PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.