Note that there are some explanatory texts on larger screens.

plurals
  1. POMemory efficient (constant) and speed optimized iteration over a large table in Django
    primarykey
    data
    text
    <p>I have a very large table. It's currently in a MySQL database. I use django.</p> <p>I need to iterate over <strong>each</strong> element of the table to pre-compute some particular data (maybe if I was better I could do otherwise but that's not the point).</p> <p>I'd like to keep the iteration as fast as possible with a constant usage of memory.</p> <p>As it is already clearly in <a href="https://stackoverflow.com/questions/4856882/limiting-memory-use-in-a-large-django-queryset">Limiting Memory Use in a *Large* Django QuerySet</a> and <a href="https://stackoverflow.com/questions/4222176/why-is-iterating-through-a-large-django-queryset-consuming-massive-amounts-of-me">Why is iterating through a large Django QuerySet consuming massive amounts of memory?</a>, a simple iteration over all objects in django will kill the machine as it will retrieve ALL objects from the database.</p> <h2>Towards a solution</h2> <p>First of all, to reduce your memory consumption you should be sure DEBUG is False (or monkey patch the cursor: <a href="https://stackoverflow.com/questions/7768027/turn-off-sql-logging-while-keeping-settings-debug">turn off SQL logging while keeping settings.DEBUG?</a>) to be sure django isn't storing stuff in <code>connections</code> for debug.</p> <p>But even with that,</p> <pre><code>for model in Model.objects.all() </code></pre> <p>is a no go. </p> <p>Not even with the slightly improved form:</p> <pre><code>for model in Model.objects.all().iterator() </code></pre> <p>Using <a href="https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.iterator" rel="nofollow noreferrer"><code>iterator()</code></a> will save you some memory by not storing the result of the cache internally (though not necessarily on PostgreSQL!); but will still retrieve the whole objects from the database, apparently.</p> <h2>A naive solution</h2> <p>The <a href="https://stackoverflow.com/a/5188179/422670">solution in the first question</a> is to slice the results based on a counter by a <code>chunk_size</code>. There are several ways to write it, but basically they all come down to an <code>OFFSET + LIMIT</code> query in SQL.</p> <p>something like:</p> <pre><code>qs = Model.objects.all() counter = 0 count = qs.count() while counter &lt; count: for model in qs[counter:counter+count].iterator() yield model counter += chunk_size </code></pre> <p>While this is memory efficient (constant memory usage proportional to <code>chunk_size</code>), it's really poor in term of speed: as OFFSET grows, both MySQL and PostgreSQL (and likely most DBs) will start choking and slowing down.</p> <h2>A better solution</h2> <p>A better solution is available in <a href="http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/" rel="nofollow noreferrer">this post</a> by Thierry Schellenbach. It filters on the PK, which is way faster than offsetting (how fast probably depends on the DB)</p> <pre><code>pk = 0 last_pk = qs.order_by('-pk')[0].pk queryset = qs.order_by('pk') while pk &lt; last_pk: for row in qs.filter(pk__gt=pk)[:chunksize]: pk = row.pk yield row gc.collect() </code></pre> <p>This is starting to get satisfactory. Now Memory = O(C), and Speed ~= O(N)</p> <h2>Issues with the "better" solution</h2> <p>The better solution only works when the PK is available in the QuerySet. Unluckily, that's not always the case, in particular when the QuerySet contains combinations of distinct (group_by) and/or values (ValueQuerySet).</p> <p>For that situation the "better solution" cannot be used.</p> <h2>Can we do better?</h2> <p>Now I'm wondering if we can go faster and avoid the issue regarding QuerySets without PK. Maybe using something that I found in other answers, but only in pure SQL: using <strong>cursors</strong>.</p> <p>Since I'm quite bad with raw SQL, in particular in Django, here comes the real question: </p> <p><strong>how can we build a better Django QuerySet Iterator for large tables</strong></p> <p>My take from what I've read is that we should use server-side cursors (apparently (see references) using a standard Django Cursor would not achieve the same result, because by default both python-MySQL and psycopg connectors cache the results).</p> <p>Would this really be a faster (and/or more efficient) solution?</p> <p>Can this be done using raw SQL in django? Or should we write specific python code depending on the database connector?</p> <p>Server Side cursors in <a href="http://initd.org/psycopg/docs/usage.html#server-side-cursors" rel="nofollow noreferrer">PostgreSQL</a> and in <a href="http://mysql-python.sourceforge.net/MySQLdb.html#using-and-extending" rel="nofollow noreferrer">MySQL</a></p> <p>That's as far as I could get for the moment...</p> <h2>a Django <code>chunked_iterator()</code></h2> <p>Now, of course the best would have this method work as <code>queryset.iterator()</code>, rather than <code>iterate(queryset)</code>, and be part of django core or at least a pluggable app.</p> <p><strong>Update</strong> Thanks to "T" in the comments for finding a <a href="https://code.djangoproject.com/ticket/16614" rel="nofollow noreferrer">django ticket</a> that carry some additional information. Differences in connector behaviors make it so that probably the best solution would be to create a specific <code>chunked</code> method rather than transparently extending <code>iterator</code> (sounds like a good approach to me). An implementation stub <a href="https://github.com/akaariai/django-old/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835" rel="nofollow noreferrer">exists</a>, but there hasn't been any work in a year, and it does not look like the author is ready to jump on that yet.</p> <h2>Additional Refs:</h2> <ol> <li><a href="https://stackoverflow.com/questions/4481388/why-does-mysql-higher-limit-offset-slow-the-query-down">Why does MYSQL higher LIMIT offset slow the query down?</a></li> <li><a href="https://stackoverflow.com/questions/1243952/how-can-i-speed-up-a-mysql-query-with-a-large-offset-in-the-limit-clause">How can I speed up a MySQL query with a large offset in the LIMIT clause?</a></li> <li><a href="http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/" rel="nofollow noreferrer">http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/</a></li> <li><a href="https://stackoverflow.com/questions/7974155/postgresql-offset-limit-gets-to-be-very-slow">postgresql: offset + limit gets to be very slow</a></li> <li><a href="https://stackoverflow.com/questions/6618366/improving-offset-performance-in-postgresql">Improving OFFSET performance in PostgreSQL</a></li> <li><a href="http://www.depesz.com/2011/05/20/pagination-with-fixed-order/" rel="nofollow noreferrer">http://www.depesz.com/2011/05/20/pagination-with-fixed-order/</a></li> <li><a href="https://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python">How to get a row-by-row MySQL ResultSet in python</a> Server Side Cursor in MySQL</li> </ol> <h2>Edits:</h2> <p><strong>Django 1.6 is adding persistent database connections</strong></p> <p><a href="https://docs.djangoproject.com/en/dev/ref/databases/#persistent-connections" rel="nofollow noreferrer">Django Database Persistent Connections</a></p> <p>This should facilitate, under some conditions, using cursors. Still it's outside my current skills (and time to learn) how to implement such a solution..</p> <p>Also, the "better solution" definitely does not work in all situations and cannot be used as a generic approach, only a stub to be adapted case by case...</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload