StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
18178658
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-08-12T02:09:28.283
FavoriteCount
0
LastActivityDate
2013-08-12T02:09:28.283
LastEditDate
LastEditorUserId
0
OwnerUserId
2014584
ParentId
18104481
PostTypeId
2
Score
3
ViewCount
0
LastEditorDisplayName
text
Body
It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python. Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using : <pre><code>paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) | head -1000000 > largefile.txt </code></pre> After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (<code>.split()</code> took approximately 2x longer than <code>.append()</code> on my machine.) <pre><code>1000000 0.333 0.000 0.333 0.000 {method 'split' of 'str' objects} 1000000 0.154 0.000 0.154 0.000 {method 'append' of 'list' objects} </code></pre> So I factored the split into another function and use a pool to distribute the work of splitting the fields : <pre><code>import sys import collections import multiprocessing as mp d = collections.defaultdict(list) def split(l): return l.split() pool = mp.Pool(processes=4) for keys in pool.map(split, open(sys.argv[1])): d[keys[0]].append(keys[1:]) </code></pre> Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this : <pre><code>$ time python process.py smallfile.txt real 0m7.170s user 0m6.884s sys 0m0.260s </code></pre> versus the parallel version : <pre><code>$ time python process-mp.py smallfile.txt real 0m16.655s user 0m24.688s sys 0m1.380s </code></pre> Because the <code>.map()</code> call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work. To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. PORead large file in parallel?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USlmjohns3
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COI am surprised the dictionary creation wasn't the slowest part for you as it is in my profiling. About the line length I think I wasn't clear. Each line will not be exactly the same length but will split into the same number of parts. My new fake data example is more realistic that I just added.
 singulars
 PostPostId
 PO
 UserUserId
 USAnush
2. COpool.map(split,open(sys.argv[1]),10000) will group lines by 10000-length chunks, but execution time not differs then non-chunked version.
 singulars
 PostPostId
 PO
 UserUserId
 USeri

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.