StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POImproving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver
primarykey
Id
5198380
data
AcceptedAnswerId
0
AnswerCount
4
ClosedDate
CommentCount
11
CommunityOwnedDate
CreationDate
2011-03-04T19:33:37.730
FavoriteCount
9
LastActivityDate
2011-12-02T07:00:35.067
LastEditDate
2011-03-21T00:40:18.607
LastEditorUserId
38403
OwnerUserId
631199
ParentId
0
PostTypeId
1
Score
42
ViewCount
6107
LastEditorDisplayName
text
Body
<h1>Overview</h1> <p>I'm attempting to improve the performance of our database queries for SQLAlchemy. We're using psycopg2. In our production system, we're chosing to go with Java because it is simply faster by at least 50%, if not closer to 100%. So I am hoping someone in the Stack Overflow community has a way to improve my performance.</p> <p>I think my next step is going to be to end up patching the psycopg2 library to behave like the JDBC driver. If that's the case and someone has already done this, that would be fine, but I am hoping I've still got a settings or refactoring tweak I can do from Python.</p> <h1>Details</h1> <p>I have a simple "SELECT * FROM someLargeDataSetTable" query running. The dataset is GBs in size. A quick performance chart is as follows:</p> <h2>Timing Table</h2> <pre> Records | JDBC | SQLAlchemy[1] | SQLAlchemy[2] | Psql -------------------------------------------------------------------- 1 (4kB) | 200ms | 300ms | 250ms | 10ms 10 (8kB) | 200ms | 300ms | 250ms | 10ms 100 (88kB) | 200ms | 300ms | 250ms | 10ms 1,000 (600kB) | 300ms | 300ms | 370ms | 100ms 10,000 (6MB) | 800ms | 830ms | 730ms | 850ms 100,000 (50MB) | 4s | 5s | 4.6s | 8s 1,000,000 (510MB) | 30s | 50s | 50s | 1m32s 10,000,000 (5.1GB) | 4m44s | 7m55s | 6m39s | n/a -------------------------------------------------------------------- 5,000,000 (2.6GB) | 2m30s | 4m45s | 3m52s | 14m22s -------------------------------------------------------------------- [1] - With the processrow function [2] - Without the processrow function (direct dump) </pre> <p>I could add more (our data can be as much as terabytes), but I think changing slope is evident from the data. JDBC just performs significantly better as the dataset size increases. Some notes...</p> <h2>Timing Table Notes:</h2> <ul> <li>The datasizes are approximate, but they should give you an idea of the amount of data.</li> <li>I'm using the 'time' tool from a Linux bash commandline.</li> <li>The times are the wall clock times (i.e. real).</li> <li>I'm using Python 2.6.6 and I'm running with <code>python -u</code></li> <li>Fetch Size is 10,000</li> <li>I'm not really worried about the Psql timing, it's there just as a reference point. I may not have properly set fetchsize for it.</li> <li>I'm also really not worried about the timing below the fetch size as less than 5 seconds is negligible to my application.</li> <li>Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!).</li> <li>I'm using the <a href="http://www.bytereef.org/mpdecimal/doc/cdecimal/index.html" rel="noreferrer">[cdecimals]</a> library.</li> <li>I noticed a <a href="http://osdir.com/ml/pgsql-performance/2010-12/msg00197.html" rel="noreferrer">[recent article]</a> discussing something similar to this. It appears that the JDBC driver design is totally different to the psycopg2 design (which I think is rather annoying given the performance difference).</li> <li>My use-case is basically that I have to run a daily process (with approximately 20,000 different steps... multiple queries) over very large datasets and I have a very specific window of time where I may finish this process. The Java we use is not simply JDBC, it's a "smart" wrapper on top of the JDBC engine... we don't want to use Java and we'd like to stop using the "smart" part of it.</li> <li>I'm using one of our production system's boxes (database and backend process) to run the query. So this is our best-case timing. We have QA and Dev boxes that run much slower and the extra query time can become significant.</li> </ul> <h2>testSqlAlchemy.py</h2> <pre> #!/usr/bin/env python # testSqlAlchemy.py import sys try: import cdecimal sys.modules["decimal"]=cdecimal except ImportError,e: print >> sys.stderr, "Error: cdecimal didn't load properly." raise SystemExit from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker def processrow (row,delimiter="|",null="\N"): newrow = [] for x in row: if x is None: x = null newrow.append(str(x)) return delimiter.join(newrow) fetchsize = 10000 connectionString = "postgresql+psycopg2://usr:pass@server:port/db" eng = create_engine(connectionString, server_side_cursors=True) session = sessionmaker(bind=eng)() with open("test.sql","r") as queryFD: with open("/dev/null","w") as nullDev: query = session.execute(queryFD.read()) cur = query.cursor while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']: for row in query.fetchmany(fetchsize): print >> nullDev, processrow(row) </pre> <p>After timing, I also ran a cProfile and this is the dump of worst offenders:</p> <h2>Timing Profile (with processrow)</h2> <pre> Fri Mar 4 13:49:45 2011 sqlAlchemy.prof 415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 563.924 563.924 {execfile} 1 25.151 25.151 563.924 563.924 testSqlAlchemy.py:2() 1001 0.050 0.000 329.285 0.329 base.py:2679(fetchmany) 1001 5.503 0.005 314.665 0.314 base.py:2804(_fetchmany_impl) 10000003 4.328 0.000 307.843 0.000 base.py:2795(_fetchone_impl) 10011 0.309 0.000 302.743 0.030 base.py:2790(__buffer_rows) 10011 233.620 0.023 302.425 0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects} 10000000 145.459 0.000 209.147 0.000 testSqlAlchemy.py:13(processrow) </pre> <h2>Timing Profile (without processrow)</h2> <pre> Fri Mar 4 14:03:06 2011 sqlAlchemy.prof 305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 536.370 536.370 {execfile} 1 29.503 29.503 536.369 536.369 testSqlAlchemy.py:2() 1001 0.066 0.000 333.806 0.333 base.py:2679(fetchmany) 1001 5.444 0.005 318.462 0.318 base.py:2804(_fetchmany_impl) 10000003 4.389 0.000 311.647 0.000 base.py:2795(_fetchone_impl) 10011 0.339 0.000 306.452 0.031 base.py:2790(__buffer_rows) 10011 235.664 0.024 306.102 0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects} 10000000 32.904 0.000 172.802 0.000 base.py:2246(__repr__) </pre> <h1>Final Comments</h1> <p>Unfortunately, the processrow function needs to stay unless there is a way within SQLAlchemy to specify null = 'userDefinedValueOrString' and delimiter = 'userDefinedValueOrString' of the output. The Java we are using currently already does this, so the comparison (with processrow) needed to be apples to apples. If there is a way to improve the performance of either processrow or SQLAlchemy with pure Python or a settings tweak, I'm very interested.</p>
Tags
<performance><postgresql><jdbc><psycopg2>
Title
Improving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USWayne Koorts
UserOwnerUserId
1. USBrian Bruggeman
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
2. PO
  singulars
  PostTypePostTypeId
  PTAnswer
3. PO
  singulars
  PostTypePostTypeId
  PTAnswer
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  POImproving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  POImproving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver
  UserUserId
  USDavid
  VoteTypeVoteTypeId
  VTFavorite
3. VO
  singulars
  PostPostId
  POImproving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.