Note that there are some explanatory texts on larger screens.

plurals
  1. POlist comprehension filtering - "the set() trap"
    text
    copied!<p>A reasonably common operation is to filter one <code>list</code> based on another <code>list</code>. People quickly find that this:</p> <pre><code>[x for x in list_1 if x in list_2] </code></pre> <p>is slow for large inputs - it's O(n*m). Yuck. How do we speed this up? Use a <code>set</code> to make filtering lookups O(1):</p> <pre><code>s = set(list_2) [x for x in list_1 if x in s] </code></pre> <p>This gives nice overall O(n) behavior. I however often see even veteran coders fall into <em>The Trap</em>™:</p> <pre><code>[x for x in list_1 if x in set(list_2)] </code></pre> <p>Ack! This is again O(n*m) since python builds <code>set(list_2)</code> <em>every</em> time, not just once.</p> <hr> <p>I thought that was the end of the story - python can't optimize it away to only build the <code>set</code> once. Just be aware of the pitfall. Gotta live with it. Hmm.</p> <pre><code>#python 3.3.2+ list_2 = list(range(20)) #small for demonstration purposes s = set(list_2) list_1 = list(range(100000)) def f(): return [x for x in list_1 if x in s] def g(): return [x for x in list_1 if x in set(list_2)] def h(): return [x for x in list_1 if x in {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}] %timeit f() 100 loops, best of 3: 7.31 ms per loop %timeit g() 10 loops, best of 3: 77.4 ms per loop %timeit h() 100 loops, best of 3: 6.66 ms per loop </code></pre> <p>Huh, python (3.3) <em>can</em> optimize away a set literal. It's even faster than <code>f()</code> in this case, presumably because it gets to replace a <code>LOAD_GLOBAL</code> with a <code>LOAD_FAST</code>.</p> <pre><code>#python 2.7.5+ %timeit h() 10 loops, best of 3: 72.5 ms per loop </code></pre> <p>Python 2 notably doesn't do this optimization. I've tried investigating further what python3 is doing but unfortunately <code>dis.dis</code> cannot probe the innards of comprehension expressions. Basically everything interesting turns into <code>MAKE_FUNCTION</code>.</p> <p>So now I'm wondering - why can python 3.x optimize away the set literal to only build once, but not <code>set(list_2)</code>?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload