StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>The standard <a href="http://docs.python.org/2/library/functions.html#open"><code>open()</code> function</a> already, by default, returns a buffered file (if available on your platform). For file objects that is <em>usually</em> fully buffered. </p> <p><em>Usually</em> here means that Python leaves this to the C stdlib implementation; it uses a <a href="http://linux.die.net/man/3/fopen"><code>fopen()</code> call</a> (<code>wfopen()</code> on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is <em>exactly</em> what you want.</p> <p>The XML parsing done by <code>iterparse</code> reads the file in chunks of 16384 bytes (16kb).</p> <p>If you want to control the buffersize, use the <code>buffering</code> keyword argument:</p> <pre><code>open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads </code></pre> <p>which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to <a href="http://seann.herdejurgen.com/resume/samag.com/html/v11/i04/a6.htm">this article</a> increasing the read buffer <em>should</em> help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.</p> <p>The <a href="http://docs.python.org/2/library/io.html#io.open"><code>io.open()</code> function</a> represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.</p> <p>You <em>could</em> try and see if <code>io.open('foo.xml', 'rb', buffering=2<<16)</code> is going to perform any better. Opening in <code>rb</code> mode will give you a <a href="http://docs.python.org/2/library/io.html#io.BufferedReader"><code>io.BufferedReader</code> instance</a>.</p> <p>You do <em>not</em> want to use <code>io.TextIOWrapper</code>; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in <code>r</code> (textmode) instead.</p> <p>Using <code>io.open()</code> may give you more flexibility and a richer API, but the underlying C file object is opened using <code>open()</code> instead of <code>fopen()</code>, and all buffering is handled by the Python <code>io.BufferedIOBase</code> implementation.</p> <p>Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload