StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
8775633
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2012-01-08T05:34:04.463
FavoriteCount
0
LastActivityDate
2012-01-08T10:16:10.740
LastEditDate
2012-01-08T10:16:10.740
LastEditorUserId
712603
OwnerUserId
712603
ParentId
8775475
PostTypeId
2
Score
11
ViewCount
0
LastEditorDisplayName
text
Body
ETA: Now that you've posted your code, I can tell you there is a simple way to do what you're doing MUCH faster (>100 times faster). I see that what you're doing is adding a frequency in parentheses to each item in a list of strings. Instead of counting all the elements each time (which, as you can confirm using cProfile, is by far the largest bottleneck in your code), you can just create a <a href="http://docs.python.org/tutorial/datastructures.html#dictionaries" rel="noreferrer">dictionary</a> that maps from each element to its frequency. That way, you only have to go through the list twice- once to create the frequency dictionary, once to use it to add frequency. Here I'll show my new method, time it, and compare it to the old method using a generated test case. The test case even shows the new result to be exactly identical to the old one. Note: All you really need to pay attention to below is the new_method. <pre><code>import random import time import collections import cProfile LIST_LEN = 14000 def timefunc(f): t = time.time() f() return time.time() - t def random_string(length=3): """Return a random string of given length""" return "".join([chr(random.randint(65, 90)) for i in range(length)]) class Profiler: def __init__(self): self.original = [[random_string() for i in range(LIST_LEN)] for j in range(4)] def old_method(self): self.ListVar = self.original[:] for b in range(len(self.ListVar)): self.list1 = [] self.temp = [] for n in range(len(self.ListVar[b])): if not self.ListVar[b][n] in self.temp: self.list1.insert(n, self.ListVar[b][n] + '(' + str(self.ListVar[b].count(self.ListVar[b][n])) + ')') self.temp.insert(0, self.ListVar[b][n]) self.ListVar[b] = list(self.list1) return self.ListVar def new_method(self): self.ListVar = self.original[:] for i, inner_lst in enumerate(self.ListVar): freq_dict = collections.defaultdict(int) # create frequency dictionary for e in inner_lst: freq_dict[e] += 1 temp = set() ret = [] for e in inner_lst: if e not in temp: ret.append(e + '(' + str(freq_dict[e]) + ')') temp.add(e) self.ListVar[i] = ret return self.ListVar def time_and_confirm(self): """ Time the old and new methods, and confirm they return the same value """ time_a = time.time() l1 = self.old_method() time_b = time.time() l2 = self.new_method() time_c = time.time() # confirm that the two are the same assert l1 == l2, "The old and new methods don't return the same value" return time_b - time_a, time_c - time_b p = Profiler() print p.time_and_confirm() </code></pre> When I run this, it gets times of (15.963812112808228, 0.05961179733276367), meaning it's about 250 times faster, though this advantage depends on both how long the lists are and the frequency distribution within each list. I'm sure you'll agree that with this speed advantage, you probably won't need to use multiprocessing :) (My original answer is left in below for posterity) ETA: By the way, it is worth noting that this algorithm is roughly linear in the length of the lists, while the code you used is quadratic. This means it performs with even more of an advantage the larger the number of elements. For example, if you increase the length of each list to 1000000, it takes only 5 seconds to run. Based on extrapolation, the old code would take over a day :) <hr> It depends on the operation you are performing. For example: <pre><code>import time NUM_RANGE = 100000000 from multiprocessing import Process def timefunc(f): t = time.time() f() return time.time() - t def multi(): class MultiProcess(Process): def __init__(self): Process.__init__(self) def run(self): # Alter string + test processing speed for i in xrange(NUM_RANGE): a = 20 * 20 thread1 = MultiProcess() thread2 = MultiProcess() thread1.start() thread2.start() thread1.join() thread2.join() def single(): for i in xrange(NUM_RANGE): a = 20 * 20 for i in xrange(NUM_RANGE): a = 20 * 20 print timefunc(multi) / timefunc(single) </code></pre> On my machine, the multiprocessed operation takes up only ~60% the time of the singlethreaded one.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POPython, using multiprocess is slower than not using it
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USDavid Robinson
UserOwnerUserId
1. USDavid Robinson
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POPython, using multiprocess is slower than not using it
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.