StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POGet random record set with Django, what is affecting the performance？
text
Body
copied!<p>It said that </p> <pre><code>Record.objects.order_by('?')[:n] </code></pre> <p>have performance issues, and recommend doing something like this: (<a href="https://stackoverflow.com/questions/1731346/how-to-get-two-random-records-with-django">here</a>) </p> <pre><code>sample = random.sample(xrange(Record.objects.count()),n) result = [Record.objects.all()[i] for i in sample] </code></pre> <p>Since that, why not do it directly like this: </p> <pre><code>result = random.sample(Record.objects.all(),n) </code></pre> <p>I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?</p> <p>================Edit 2013-5-12 23:21 UCT+8 ======================== </p> <p>I spent my whole afternoon to do this test.<br> My computer : CPU Intel i5-3210M RAM 8G System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6</p> <p>What I did was: </p> <ol> <li>Create an app. </li> <li>build a simple model with a index and a <code>CharField</code> content, then <code>Syncdb</code>. </li> <li>Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used. </li> <li>Modify <code>settings.py</code> that <code>Django</code> can output log into console. </li> <li>Insert rows into table, untill the number of the rows is what I want. </li> <li>Visit the 3 views, note the SQL Query statement, SQL time, and the total time</li> <li>repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m) </li> </ol> <p>This is <code>views.py</code>:</p> <pre><code>def test1(request): start = datetime.datetime.now() result = Record.objects.order_by('?')[:20] l = list(result) # Queryset是惰性的，强制将Queryset转为list end = datetime.datetime.now() return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000)) def test2(request): start = datetime.datetime.now() sample = random.sample(xrange(Record.objects.count()),20) result = [Record.objects.all()[i] for i in sample] l = list(result) end = datetime.datetime.now() return HttpResponse("time: <br/> %s" % (end-start) def test3(request): start = datetime.datetime.now() result = random.sample(Record.objects.all(),20) l = list(result) end = datetime.datetime.now() return HttpResponse("time: <br/> %s" % (end-start) </code></pre> <p>As @Yeo said,<code>result = random.sample(Record.objects.all(),n)</code> is crap. I won't talk about that.<br> But interestingly, <code>Record.objects.order_by('?')[:n]</code> always better then others, especially the table smaller then 1m rows. Here is the data: <img src="https://i.stack.imgur.com/Bl8dN.jpg" alt="test data"> </p> <p>and the charts: <img src="https://i.stack.imgur.com/kdSXZ.jpg" alt="SQL cost"> <img src="https://i.stack.imgur.com/SEwqY.jpg" alt="total cost"></p> <p>So, what's happened? In the last test, 5,195,536 rows in tatget table, <code>result = random.sample(Record.objects.all(),n)</code> actually did ths:</p> <pre><code>(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=() </code></pre> <p>Every one is right. And it used 22 seconds. And</p> <pre><code>sample = random.sample(xrange(Record.objects.count()),n) result = [Record.objects.all()[i] for i in sample] </code></pre> <p>actually did ths:</p> <pre><code>(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=() (3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=() ...20 lines </code></pre> <p>As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.<br> But... why? </p> <p>My think is:<br> If there is some way can speed up the large index query, </p> <pre><code>sample = random.sample(xrange(Record.objects.count()),n) result = [Record.objects.all()[i] for i in sample] </code></pre> <p>should be the best. Except(!) the table is smaller then 1m rows.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload