Note that there are some explanatory texts on larger screens.

plurals
  1. POStrategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents
    primarykey
    data
    text
    <p>I have thousands of SGML documents, some well-formed, some not so well-formed. I need to get at certain ELEMENTS in the documents, but everytime I go to load and try to read them into an XDocument, XMLDocument, or even just a StreamReader, I get different various XMLException errors. </p> <p>Things like "'[' is an unexpected token.". Why? Because I have a document with DOCTYPE like</p> <pre><code>&lt;!DOCTYPE RChapter PUBLIC "-//LSC//DTD R Chapter for Authoring//EN" [] &gt; </code></pre> <p>and I have learned that the "[]" needs to have something valid inside. Again, I don't control the creation of the documents, but I DO HAVE to "crack" them and get at the data I want. Another example is having an "unclosed" ELEMENT, for example:</p> <pre><code>&lt;Caption&gt;Plants, and facilities&lt;hardhyphen&gt;&lt;hyphen&gt;Inspection.&lt;/Caption&gt; </code></pre> <p>This XMLException is "The 'hyphen' start tag on line 27 does not match the end tag of 'Caption'. Line 27, position 58." Obvious, right?</p> <p>But then the question is how can you actually get at certain ELEMENTS in these documents, without encountering XMLExceptions. Is a SAX parser the right way? I basically want to open the document, go right to the element I want (without worrying what might or might not be well-formed nearby), pull the data, and move on. Should I just forget parsing with XMLDocument, XDocument, and just do simple string replacements like </p> <pre><code>str.Replace("&lt;hardhypen&gt;&lt;hyphen&gt;", "-") </code></pre> <p>and then try to load it into one of the XML parsers. Any tips on strategies?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload