Note that there are some explanatory texts on larger screens.

plurals
  1. POProcess large file in chunks
    text
    copied!<p>I have a large file which has two numbers per line and is sorted by the second column. I make a dictionary of lists keyed on the first column. </p> <p>My code looks like </p> <pre><code>from collections import defaultdict d = defaultdict(list) for line in fin.readline(): vals = line.split() d[vals[0]].append(vals[1]) process(d) </code></pre> <p>However the input file large is too large so <code>d</code> will not fit into memory. </p> <p>To get round this I can in principle read in chunks of the file at a time but I need to make an overlap between the chunks so that <code>process(d)</code> won't miss anything. </p> <p>In pseudocode I could do the following.</p> <ol> <li>Read 100 lines creating the dictionary <code>d</code>.</li> <li>Process the dictionary <code>d</code> </li> <li>Delete everything from <code>d</code> that is not within 10 of the max value seen so far.</li> <li>Repeat but making sure we don't have more than 100 lines worth of data in <code>d</code> at any time.</li> </ol> <p>Is there a nice way to do this in python?</p> <p><strong>Update.</strong> More details of the problem. I will use <code>d</code> when reading in a second file of pairs where I will output the pair if depending on how many values there are in the list associated with the first value in <code>d</code> which are within 10. The second file is also sorted by the second column.</p> <p><strong>Fake data.</strong> Let's say we can fit 5 lines of data into memory and we need the overlap in values to be 5 as well.</p> <pre><code>1 1 2 1 1 6 7 6 1 16 </code></pre> <p>So now d is {1:[1,6,16],2:[1],7:[6]}.</p> <p>For the next chunk we only need to keep the last value (as 16-6 > 5). So we would set </p> <p>d to be {1:[16]} and continue reading the next <strong>4</strong> lines.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload