StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>The right thing to do here is make sure that the creator of the XML file makes sure that: A.) that the encoding of the file is declared B.) that the XML file is well formed (no invalid characters control characters, no invalid characters that are not falling into the encoding scheme, all elements are properly closed etc.) C.) use a DTD or an XML schema if you want to ensure that certain attributes/elements exist, have certain values or correspond to a certain format (note: this will take a performance hit)</p> <p>So, now to your question. LXml supports a whole bunch of arguments when you use it to parse XML. <a href="http://lxml.de/parsing.html">Check out the documentation</a>. You will want to look at these two arguments:</p> <p>--> recover --> try hard to parse through broken XML<br> --> huge_tree --> disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+) </p> <p>They will help you to some degree, but certain invalid characters can just not be recovered from, so again, ensuring that the file is written correctly is your best bet to clean/well working code.</p> <p>Ah yeah and one more thing. 2GB is huge. I assume you have a list of similar elements in this file (example list of books). Try to split the file up with a Regex Expression on the OS, then start multiple processes to part the pieces. That way you will be able to use more of your cores on your box and the processing time will go down. Of course you then have to deal with the complexity of merging the results back together. I can not make this trade off for you, but wanted to give it to you as "food for thought"</p> <p><strong>Addition to post:</strong> If you have no control over the input file and have bad characters in it, I would try to replace/remove these bad characters by iterating over the string before parsing it as a file. Here a code sample that removes <a href="http://en.wikipedia.org/wiki/List_of_Unicode_characters">Unicode control characters that you wont need</a>:</p> <pre><code>#all unicode characters from 0x0000 - 0x0020 (33 total) are bad and will be replaced by "" (empty string) for line in fileinput.input(xmlInputFileLocation, inplace=1): for pos in range(0,len(line)): if unichr(line[pos]) < 32: line[pos] = None print u''.join([c for c in line if c]) </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload