Note that there are some explanatory texts on larger screens.

plurals
  1. POParsing unstructured documents into XML
    text
    copied!<p>I am parsing unstructured documents into a structured representation (XML) using a template to describe the intended result. A simple typical problem might be a list of strings:</p> <pre><code>"Chapter 1" "Section background" "this is something" "this is another" "Section methods" "take some xxx" "do yyy" "and some..." "Chapter apparatus" "we created..." </code></pre> <p>which I wish to transform to:</p> <pre><code>&lt;div role="CHAPTER" title="1"&gt; &lt;div role="SECTION" title="background"&gt; &lt;p&gt;this is a paragraph...&lt;/p&gt; &lt;p&gt;this is another...&lt;/p&gt; &lt;/div&gt; &lt;div role="SECTION" title="methods"&gt; &lt;p&gt;take some xxx&lt;/p&gt; &lt;p&gt;do yyy&lt;/p&gt; &lt;p&gt;and some...&lt;/p&gt; &lt;/div&gt; &lt;/div&gt; &lt;div role="CHAPTER" title="apparatus"&gt; &lt;div role="SECTION" title="???"&gt; &lt;p&gt;we created...&lt;/p&gt; &lt;/div&gt; &lt;/div&gt; </code></pre> <p>The labels CHAPTER and SECTION are not present in the strings but are generated from heuristic regexes (e.g. "<code>[Cc]hap(ter)?(\s\d+\.)?.*</code>") and are applied to all strings.</p> <p>The intended result is described by a "template" which currently looks something like:</p> <pre><code>&lt;template count="0," role="CHAPTER"&gt; &lt;regex&gt;[Cc]hap(ter)?(\s+.*)&lt;/regex&gt; &lt;template count="0," role="SECTION"&gt; &lt;regex&gt;[Ss]ec(tion)?(\s+.*)&lt;/regex&gt; &lt;template count="0," role="p"&gt; &lt;regex&gt;.*&lt;/regex&gt; &lt;/template&gt; &lt;/template&gt; &lt;/template&gt; </code></pre> <p>(In some cases counts can be ranges, e.g. 2,4).</p> <p>I know this is a very hard problem (SGML attempted to tackle parts of it) and that real documents do not conform tidily to such templates, so I am prepared for partial parses and to lose some precision and recall.</p> <p>For some years I have used my own working code which works for documents up to a few megabytes over a range of types. Performance is not an issue. I have different templates for different document types (theses, logfiles, fortran output, etc.). Some documents have a nested structure (e.g. as above) while others are flatter but have many more types of markup.</p> <p>I am now refactoring this and wonder:</p> <ul> <li>is there an Open source toolkit that addresses this problem? (preferably Java)</li> <li>if not, can I use XSLT2 grouping strategy combined with regular expressions</li> <li>or should I use an automaton? If so, should I use a toolkit or write my own?</li> </ul> <p>EDIT: @naspinski and generally. It will always be possible to write specific scripting code to solve particular problems. I want a general solution as I may be parsing many (even millions) of documents with consisderable (but not infinite) variability in structure. I want the structure of the parsed documents to be expressed in XML, not script. I believe that it will be easier to add new solutions through templates (declarative) rather than scripts.</p> <p><strong>EDIT I am almost certain that my best approach now is to use ANTLR.</strong> It is a powerful tool which from my initial explorations can parse lines and groups of lines.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload