Note that there are some explanatory texts on larger screens.

plurals
  1. POImproving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver
    primarykey
    data
    text
    <h1>Overview</h1> <p>I'm attempting to improve the performance of our database queries for SQLAlchemy. We're using psycopg2. In our production system, we're chosing to go with Java because it is simply faster by at least 50%, if not closer to 100%. So I am hoping someone in the Stack Overflow community has a way to improve my performance.</p> <p>I think my next step is going to be to end up patching the psycopg2 library to behave like the JDBC driver. If that's the case and someone has already done this, that would be fine, but I am hoping I've still got a settings or refactoring tweak I can do from Python.</p> <h1>Details</h1> <p>I have a simple "SELECT * FROM someLargeDataSetTable" query running. The dataset is GBs in size. A quick performance chart is as follows:</p> <h2>Timing Table</h2> <pre> Records | JDBC | SQLAlchemy[1] | SQLAlchemy[2] | Psql -------------------------------------------------------------------- 1 (4kB) | 200ms | 300ms | 250ms | 10ms 10 (8kB) | 200ms | 300ms | 250ms | 10ms 100 (88kB) | 200ms | 300ms | 250ms | 10ms 1,000 (600kB) | 300ms | 300ms | 370ms | 100ms 10,000 (6MB) | 800ms | 830ms | 730ms | 850ms 100,000 (50MB) | 4s | 5s | 4.6s | 8s 1,000,000 (510MB) | 30s | 50s | 50s | 1m32s 10,000,000 (5.1GB) | 4m44s | 7m55s | 6m39s | n/a -------------------------------------------------------------------- 5,000,000 (2.6GB) | 2m30s | 4m45s | 3m52s | 14m22s -------------------------------------------------------------------- [1] - With the processrow function [2] - Without the processrow function (direct dump) </pre> <p>I could add more (our data can be as much as terabytes), but I think changing slope is evident from the data. JDBC just performs significantly better as the dataset size increases. Some notes...</p> <h2>Timing Table Notes:</h2> <ul> <li>The datasizes are approximate, but they should give you an idea of the amount of data.</li> <li>I'm using the 'time' tool from a Linux bash commandline.</li> <li>The times are the wall clock times (i.e. real).</li> <li>I'm using Python 2.6.6 and I'm running with <code>python -u</code></li> <li>Fetch Size is 10,000</li> <li>I'm not really worried about the Psql timing, it's there just as a reference point. I may not have properly set fetchsize for it.</li> <li>I'm also really not worried about the timing below the fetch size as less than 5 seconds is negligible to my application.</li> <li>Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!).</li> <li>I'm using the <a href="http://www.bytereef.org/mpdecimal/doc/cdecimal/index.html" rel="noreferrer">[cdecimals]</a> library.</li> <li>I noticed a <a href="http://osdir.com/ml/pgsql-performance/2010-12/msg00197.html" rel="noreferrer">[recent article]</a> discussing something similar to this. It appears that the JDBC driver design is totally different to the psycopg2 design (which I think is rather annoying given the performance difference).</li> <li>My use-case is basically that I have to run a daily process (with approximately 20,000 different steps... multiple queries) over very large datasets and I have a very specific window of time where I may finish this process. The Java we use is not simply JDBC, it's a "smart" wrapper on top of the JDBC engine... we don't want to use Java and we'd like to stop using the "smart" part of it.</li> <li>I'm using one of our production system's boxes (database and backend process) to run the query. So this is our best-case timing. We have QA and Dev boxes that run much slower and the extra query time can become significant.</li> </ul> <h2>testSqlAlchemy.py</h2> <pre> #!/usr/bin/env python # testSqlAlchemy.py import sys try: import cdecimal sys.modules["decimal"]=cdecimal except ImportError,e: print >> sys.stderr, "Error: cdecimal didn't load properly." raise SystemExit from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker def processrow (row,delimiter="|",null="\N"): newrow = [] for x in row: if x is None: x = null newrow.append(str(x)) return delimiter.join(newrow) fetchsize = 10000 connectionString = "postgresql+psycopg2://usr:pass@server:port/db" eng = create_engine(connectionString, server_side_cursors=True) session = sessionmaker(bind=eng)() with open("test.sql","r") as queryFD: with open("/dev/null","w") as nullDev: query = session.execute(queryFD.read()) cur = query.cursor while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']: for row in query.fetchmany(fetchsize): print >> nullDev, processrow(row) </pre> <p>After timing, I also ran a cProfile and this is the dump of worst offenders:</p> <h2>Timing Profile (with processrow)</h2> <pre> Fri Mar 4 13:49:45 2011 sqlAlchemy.prof 415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 563.924 563.924 {execfile} 1 25.151 25.151 563.924 563.924 testSqlAlchemy.py:2() 1001 0.050 0.000 329.285 0.329 base.py:2679(fetchmany) 1001 5.503 0.005 314.665 0.314 base.py:2804(_fetchmany_impl) 10000003 4.328 0.000 307.843 0.000 base.py:2795(_fetchone_impl) 10011 0.309 0.000 302.743 0.030 base.py:2790(__buffer_rows) 10011 233.620 0.023 302.425 0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects} 10000000 145.459 0.000 209.147 0.000 testSqlAlchemy.py:13(processrow) </pre> <h2>Timing Profile (without processrow)</h2> <pre> Fri Mar 4 14:03:06 2011 sqlAlchemy.prof 305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 536.370 536.370 {execfile} 1 29.503 29.503 536.369 536.369 testSqlAlchemy.py:2() 1001 0.066 0.000 333.806 0.333 base.py:2679(fetchmany) 1001 5.444 0.005 318.462 0.318 base.py:2804(_fetchmany_impl) 10000003 4.389 0.000 311.647 0.000 base.py:2795(_fetchone_impl) 10011 0.339 0.000 306.452 0.031 base.py:2790(__buffer_rows) 10011 235.664 0.024 306.102 0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects} 10000000 32.904 0.000 172.802 0.000 base.py:2246(__repr__) </pre> <h1>Final Comments</h1> <p>Unfortunately, the processrow function needs to stay unless there is a way within SQLAlchemy to specify null = 'userDefinedValueOrString' and delimiter = 'userDefinedValueOrString' of the output. The Java we are using currently already does this, so the comparison (with processrow) needed to be apples to apples. If there is a way to improve the performance of either processrow or SQLAlchemy with pure Python or a settings tweak, I'm very interested.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload