Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to make lxml's iterparse ignore invalid XML charachters?
    primarykey
    data
    text
    <p>I have an XML with invalid characters. LXML's XMLParser throws an exception on these invalid characters, but when I create XMLParser with <strong>recover=True</strong> option, it ignores the bad characters and works OK.</p> <p>My question is how can I set similar flag for lxml's iterparse function?</p> <p><strong>Reproduction:</strong></p> <p>The broken XML (/tmp/z.xml):</p> <pre><code>&lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;items&gt; &lt;item&gt; &lt;B&gt;Bad characters:&lt;/B&gt; &lt;/item&gt; &lt;/items&gt; </code></pre> <p><strong>NOTE:</strong> There are two ASCII characters #31 (0x1F) after "Bad characters:" string, which I could not copy-paste here.</p> <p>The parsing error of XMLParser:</p> <pre><code>fd = open('/tmp/z.xml') parser = etree.XMLParser() tree = etree.parse(fd, parser) Traceback (most recent call last): File "&lt;stdin&gt;", line 1, in &lt;module&gt; File "lxml.etree.pyx", line 2576, in lxml.etree.parse (src/lxml/lxml.etree.c:22796) File "parser.pxi", line 1488, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60390) File "parser.pxi", line 1518, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:60687) File "parser.pxi", line 1401, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:59658) File "parser.pxi", line 991, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:57303) File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512) File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372) File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21 </code></pre> <p>To ignore the bad characters I set <strong>recover=True</strong> and it works OK:</p> <pre><code>import lxml.etree as etree fd = open('/tmp/z.xml') parser = etree.XMLParser(recover=True) tree = etree.parse(fd, parser) etree.tostring(tree) # OUTPUT: &lt;items&gt;\n\t&lt;item&gt;\n\t\t&lt;B&gt;Bad characters:&lt;/B&gt;\n\t&lt;/item&gt;\n&lt;/items&gt;' </code></pre> <p>With iterparse I get the same error again, but how can I make it ignore the bad characters?</p> <pre><code>fd = open('/tmp/z.xml') it = etree.iterparse(fd, events=("start", "end")) for e in it: print e ... Traceback (most recent call last): File "&lt;stdin&gt;", line 1, in &lt;module&gt; File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245) File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21 </code></pre> <p>Thanks in advance!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload