Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.</p> <h2>Offline Processing</h2> <p>There are two widely-used approaches to work which needs to happen outside the request-response cycle:</p> <ul> <li>Cron jobs (often made easier via a custom management command)</li> <li><a href="http://celeryproject.org/" rel="noreferrer">Celery</a></li> </ul> <p>In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.</p> <h3>Cron</h3> <p>Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper <code>PYTHONPATH</code> and such.</p> <p>If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run <code>path/to/manage.py yourcommand</code>. I'm not sure if this approach works without the assistance of <a href="http://pypi.python.org/pypi/virtualenv" rel="noreferrer">virtualenv</a>, which you really ought to be using regardless.</p> <p>Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.</p> <p>In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.</p> <h3>Celery</h3> <p>…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are <a href="http://docs.celeryproject.org/en/latest/configuration.html#conf-redis-result-backend" rel="noreferrer">straightforward</a>:</p> <pre><code>CELERY_RESULT_BACKEND = "redis" REDIS_HOST = "localhost" REDIS_PORT = 6379 REDIS_DB = 0 REDIS_CONNECT_RETRY = True </code></pre> <p>Voilá, no need for an AMQP broker.</p> <p>Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule <a href="http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html" rel="noreferrer">periodic tasks</a>, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.</p> <p>If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest <em>available</em> averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like <a href="http://code.google.com/p/flot/" rel="noreferrer">flot</a>. Here's an example flow:</p> <ol> <li>User requests a page which contains an averages chart.</li> <li>Check cache -- is there a stored, nonexpired queryset containing averages for this user? <ul> <li>If yes, use that.</li> <li>If not, fire off a celery task to recalculate it, requery and cache the result. Since querying <em>existing</em> data is cheap, run the query if you want to show stale data to the user in the meantime.</li> </ul></li> <li>If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.</li> </ol> <p>You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent <em>really</em> stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.</p> <h2>Caching</h2> <p>The <a href="https://docs.djangoproject.com/en/1.3/topics/cache/" rel="noreferrer">Django cache framework</a> provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on <a href="http://memcached.org/" rel="noreferrer">memcached</a> as their cache backend, I've lately started using redis with the <a href="http://pypi.python.org/pypi/django-redis-cache/" rel="noreferrer">django-redis-cache</a> backend instead, but I'm not sure I'd trust it for a major production site yet.</p> <p>Here's some code showing off usage of the <a href="https://docs.djangoproject.com/en/1.3/topics/cache/#the-low-level-cache-api" rel="noreferrer">low-level caching API</a> to accomplish the workflow laid out above:</p> <pre><code>import pickle from django.core.cache import cache from django.shortcuts import render from mytasks import calculate_stuff from celery.task import task @task def calculate_stuff(user_id): # ... do your work to update the averages ... # now pull the latest series averages = TransactionAverage.objects.filter(user=user_id, ...) # cache the pickled result for ten minutes cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10) def myview(request, user_id): ctx = {} cached = cache.get("averages_%s" % user_id, None) if cached: averages = pickle.loads(cached) # use the cached queryset else: # fetch the latest available data for now, same as in the task averages = TransactionAverage.objects.filter(user=user_id, ...) # fire off the celery task to update the information in the background calculate_stuff.delay(user_id) # doesn't happen in-process. ctx['stale_chart'] = True # display a warning, if you like ctx['averages'] = averages # ... do your other work ... render(request, 'my_template.html', ctx) </code></pre> <p><strong>Edit:</strong> worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload