StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
51745
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2008-09-09T12:31:26.437
FavoriteCount
0
LastActivityDate
2010-07-08T16:00:30.217
LastEditDate
2017-05-23T12:00:12.463
LastEditorUserId
-1
OwnerUserId
2077
ParentId
51553
PostTypeId
2
Score
13
ViewCount
0
LastEditorDisplayName
Hanno
text
Body
I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps: <ol> <li>parse the SQL</li> <li>work up a query plan, i. e. decide on which indices to use (if any), optimize etc.</li> <li>if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or</li> <li>if no index is used, scan the whole table to determine which rows are needed</li> <li>load the data from disk into a temporary location (hopefully, but not necessarily, memory)</li> <li>perform the count() and avg() calculations</li> </ol> So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to <a href="https://stackoverflow.com/questions/26021/how-is-data-compression-more-effective-than-indexing-for-search-performance">this question</a> I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are. To obtain more information about where Postgres spends its time, I would suggest the following tests: <ul> <li>Compare the execution time of your query to a SELECT without the aggregating functions (i. e. cut step 5)</li> <li>If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison.</li> </ul> To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time. There's several ways to do that: <ul> <li>Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached</li> <li>Reduce the size of your stored data</li> <li>Optimize the use of indices. Sometimes this can mean to skip index use altogether (after all, it's disk access, too). For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table.</li> <li>If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. However, I don't know whether that's applicable for Postgres.</li> <li>There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process.</li> </ul> Update: I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhy are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USHanno Fietz
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. CO1. Python also needs to parse the "query". If you run the SQL repeatedly, that step should be cached. 2. Python also needs to work up a "query plan" - that is, if JIT-ing, though that's much easier than for SQL. Again, that should be cached if run repeatedly. 3. no index is likely to be used 4. Python also needs to scan *the whole table* - a trivial undertaking for 350k lines. 5. Depending on where you get the data, python too needs to load the data from disk - and again, this should be cached if run repeatedly. -- Basically, the differences in such a straightforward case are almost nil.
 singulars
 PostPostId
 PO
 UserUserId
 USEamon Nerbonne

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.