StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POMemory efficient (constant) and speed optimized iteration over a large table in Django
primarykey
Id
14144408
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
9
CommunityOwnedDate
CreationDate
2013-01-03T17:53:38.673
FavoriteCount
16
LastActivityDate
2017-09-27T19:21:08.140
LastEditDate
2017-05-23T11:46:43.600
LastEditorUserId
-1
OwnerUserId
422670
ParentId
0
PostTypeId
1
Score
37
ViewCount
3406
LastEditorDisplayName
text
Body
<p>I have a very large table. It's currently in a MySQL database. I use django.</p> <p>I need to iterate over <strong>each</strong> element of the table to pre-compute some particular data (maybe if I was better I could do otherwise but that's not the point).</p> <p>I'd like to keep the iteration as fast as possible with a constant usage of memory.</p> <p>As it is already clearly in <a href="https://stackoverflow.com/questions/4856882/limiting-memory-use-in-a-large-django-queryset">Limiting Memory Use in a *Large* Django QuerySet</a> and <a href="https://stackoverflow.com/questions/4222176/why-is-iterating-through-a-large-django-queryset-consuming-massive-amounts-of-me">Why is iterating through a large Django QuerySet consuming massive amounts of memory?</a>, a simple iteration over all objects in django will kill the machine as it will retrieve ALL objects from the database.</p> <h2>Towards a solution</h2> <p>First of all, to reduce your memory consumption you should be sure DEBUG is False (or monkey patch the cursor: <a href="https://stackoverflow.com/questions/7768027/turn-off-sql-logging-while-keeping-settings-debug">turn off SQL logging while keeping settings.DEBUG?</a>) to be sure django isn't storing stuff in <code>connections</code> for debug.</p> <p>But even with that,</p> <pre><code>for model in Model.objects.all() </code></pre> <p>is a no go. </p> <p>Not even with the slightly improved form:</p> <pre><code>for model in Model.objects.all().iterator() </code></pre> <p>Using <a href="https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.iterator" rel="nofollow noreferrer"><code>iterator()</code></a> will save you some memory by not storing the result of the cache internally (though not necessarily on PostgreSQL!); but will still retrieve the whole objects from the database, apparently.</p> <h2>A naive solution</h2> <p>The <a href="https://stackoverflow.com/a/5188179/422670">solution in the first question</a> is to slice the results based on a counter by a <code>chunk_size</code>. There are several ways to write it, but basically they all come down to an <code>OFFSET + LIMIT</code> query in SQL.</p> <p>something like:</p> <pre><code>qs = Model.objects.all() counter = 0 count = qs.count() while counter < count: for model in qs[counter:counter+count].iterator() yield model counter += chunk_size </code></pre> <p>While this is memory efficient (constant memory usage proportional to <code>chunk_size</code>), it's really poor in term of speed: as OFFSET grows, both MySQL and PostgreSQL (and likely most DBs) will start choking and slowing down.</p> <h2>A better solution</h2> <p>A better solution is available in <a href="http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/" rel="nofollow noreferrer">this post</a> by Thierry Schellenbach. It filters on the PK, which is way faster than offsetting (how fast probably depends on the DB)</p> <pre><code>pk = 0 last_pk = qs.order_by('-pk')[0].pk queryset = qs.order_by('pk') while pk < last_pk: for row in qs.filter(pk__gt=pk)[:chunksize]: pk = row.pk yield row gc.collect() </code></pre> <p>This is starting to get satisfactory. Now Memory = O(C), and Speed ~= O(N)</p> <h2>Issues with the "better" solution</h2> <p>The better solution only works when the PK is available in the QuerySet. Unluckily, that's not always the case, in particular when the QuerySet contains combinations of distinct (group_by) and/or values (ValueQuerySet).</p> <p>For that situation the "better solution" cannot be used.</p> <h2>Can we do better?</h2> <p>Now I'm wondering if we can go faster and avoid the issue regarding QuerySets without PK. Maybe using something that I found in other answers, but only in pure SQL: using <strong>cursors</strong>.</p> <p>Since I'm quite bad with raw SQL, in particular in Django, here comes the real question: </p> <p><strong>how can we build a better Django QuerySet Iterator for large tables</strong></p> <p>My take from what I've read is that we should use server-side cursors (apparently (see references) using a standard Django Cursor would not achieve the same result, because by default both python-MySQL and psycopg connectors cache the results).</p> <p>Would this really be a faster (and/or more efficient) solution?</p> <p>Can this be done using raw SQL in django? Or should we write specific python code depending on the database connector?</p> <p>Server Side cursors in <a href="http://initd.org/psycopg/docs/usage.html#server-side-cursors" rel="nofollow noreferrer">PostgreSQL</a> and in <a href="http://mysql-python.sourceforge.net/MySQLdb.html#using-and-extending" rel="nofollow noreferrer">MySQL</a></p> <p>That's as far as I could get for the moment...</p> <h2>a Django <code>chunked_iterator()</code></h2> <p>Now, of course the best would have this method work as <code>queryset.iterator()</code>, rather than <code>iterate(queryset)</code>, and be part of django core or at least a pluggable app.</p> <p><strong>Update</strong> Thanks to "T" in the comments for finding a <a href="https://code.djangoproject.com/ticket/16614" rel="nofollow noreferrer">django ticket</a> that carry some additional information. Differences in connector behaviors make it so that probably the best solution would be to create a specific <code>chunked</code> method rather than transparently extending <code>iterator</code> (sounds like a good approach to me). An implementation stub <a href="https://github.com/akaariai/django-old/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835" rel="nofollow noreferrer">exists</a>, but there hasn't been any work in a year, and it does not look like the author is ready to jump on that yet.</p> <h2>Additional Refs:</h2> <ol> <li><a href="https://stackoverflow.com/questions/4481388/why-does-mysql-higher-limit-offset-slow-the-query-down">Why does MYSQL higher LIMIT offset slow the query down?</a></li> <li><a href="https://stackoverflow.com/questions/1243952/how-can-i-speed-up-a-mysql-query-with-a-large-offset-in-the-limit-clause">How can I speed up a MySQL query with a large offset in the LIMIT clause?</a></li> <li><a href="http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/" rel="nofollow noreferrer">http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/</a></li> <li><a href="https://stackoverflow.com/questions/7974155/postgresql-offset-limit-gets-to-be-very-slow">postgresql: offset + limit gets to be very slow</a></li> <li><a href="https://stackoverflow.com/questions/6618366/improving-offset-performance-in-postgresql">Improving OFFSET performance in PostgreSQL</a></li> <li><a href="http://www.depesz.com/2011/05/20/pagination-with-fixed-order/" rel="nofollow noreferrer">http://www.depesz.com/2011/05/20/pagination-with-fixed-order/</a></li> <li><a href="https://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python">How to get a row-by-row MySQL ResultSet in python</a> Server Side Cursor in MySQL</li> </ol> <h2>Edits:</h2> <p><strong>Django 1.6 is adding persistent database connections</strong></p> <p><a href="https://docs.djangoproject.com/en/dev/ref/databases/#persistent-connections" rel="nofollow noreferrer">Django Database Persistent Connections</a></p> <p>This should facilitate, under some conditions, using cursors. Still it's outside my current skills (and time to learn) how to implement such a solution..</p> <p>Also, the "better solution" definitely does not work in all situations and cannot be used as a generic approach, only a stub to be adapted case by case...</p>
Tags
<sql><database><django>
Title
Memory efficient (constant) and speed optimized iteration over a large table in Django
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USStefano
plurals
PostLinksPostIdRelatedPostId
1. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
2. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
3. PL
  singulars
  LinkTypeLinkTypeId
  LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
2. PO
  singulars
  PostTypePostTypeId
  PTAnswer
3. PO
  singulars
  PostTypePostTypeId
  PTAnswer
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  POMemory efficient (constant) and speed optimized iteration over a large table in Django
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  POMemory efficient (constant) and speed optimized iteration over a large table in Django
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  POMemory efficient (constant) and speed optimized iteration over a large table in Django
  UserUserId
  USDavid Robinson
  VoteTypeVoteTypeId
  VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.