StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
18994101
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-09-25T00:05:11.843
FavoriteCount
0
LastActivityDate
2013-09-25T01:19:47.617
LastEditDate
2013-09-25T01:19:47.617
LastEditorUserId
908494
OwnerUserId
908494
ParentId
18993956
PostTypeId
2
Score
4
ViewCount
0
LastEditorDisplayName
text
Body
Obviously you need to run each line from Numbers.txt against each line from Ranges.txt. You could just iterate over Numbers.txt, and, for each line, iterate over Ranges.txt. But this will take forever, reading the whole Ranges.txt file millions of times. You could read both of them into memory, but that will take a lot of storage, and it means you won't be able to do any processing until you've finished reading and preprocessing both files. So, what you want to do is read Ranges.txt into memory once and store it as, say, a list of pairs of ints instead, but read Numbers.txt lazily, iterating over the list for each number. This kind of thing comes up all the time. In general, you want to make the bigger collection into the outer loop, and make it as lazy as possible, while the smaller collection goes into the inner loop, and is pre-processed to make it as fast as possible. But if the bigger collection can be preprocessed more efficiently (and you have enough memory to store it!), reverse that. <hr> And speaking of preprocessing, you can do a lot better than just reading into a list of pairs of ints. If you sorted Ranges.txt, you could find the closest range without going over by bisecting then just check that (18 steps), instead of checking each range exhaustively (100000 steps). This is a bit of a pain with the stdlib, because it's easy to make off-by-one errors when using <a href="http://docs.python.org/3/library/bisect.html" rel="nofollow"><code>bisect</code></a>, but there are plenty of ActiveState recipes to make it easier (including one <a href="http://code.activestate.com/recipes/577197-sortedcollection/" rel="nofollow">linked from the official docs</a>), not to mention third-party modules like <a href="https://pypi.python.org/pypi/blist/" rel="nofollow"><code>blist</code></a> or <a href="https://pypi.python.org/pypi/bintrees/" rel="nofollow"><code>bintrees</code></a> that give you a sorted collection in a simple OO interface. <hr> So, something like this pseudocode: <pre><code>with open('ranges.txt') as f: ranges = sorted([map(int, line.split()) for line in f]) range_values = {} with open('numbers.txt') as f: rows = (map(int, line.split()) for line in f) for number, value in rows: use the sorted ranges to find the appropriate range (if any) range_values.setdefault(range, []).append(value) with open('output.txt') as f: for r, values in range_values.items(): mean = sum(values) / len(values) f.write('{} {} {}\n'.format(r[0], r[1], mean)) </code></pre> <hr> By the way, if the parsing turns out to be any more complicated than just calling <code>split</code> on each line, I'd suggest using the <code>csv</code> module… but it looks like that won't be a problem here. <hr> What if you can't fit Ranges.txt into memory, but can fit Numbers.txt? Well, you can sort that, then iterate over Ranges.txt, find all of the matches in the sorted numbers, and write the results out for that range. This is a bit more complicated, because it you have to bisect_left and bisect_right and iterate everything in between. But that's the only way in which it's any harder. (And here, a third-party class will help even more. For example, with a <code>bintrees.FastRBTree</code> as your sorted collection, it's just <code>sorted_number_tree[low:high]</code>.) <hr> If the ranges can overlap, you need to be a bit smarter—you have to find the closest range without going over the start, and the closest range without going under the end, and check everything in between. But the main trick there is the exact same one used for the last version. The only other trick is to keep two copies of ranges, one sorted by the start value and one by the end, and you'll need to have one of them be a map to indices in the other instead of just a plain list.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POEfficient Python way to process two huge files?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USabarnert
UserOwnerUserId
1. USabarnert
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.