StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I've done some of this recently, and here are my experiences. </p> <p>There are three basic approaches:</p> <ol> <li><strong>Regular Expressions.</strong> <ul> <li>Most flexible, easiest to use with loosely-structured info and changing formats.</li> <li>Harder to do structural/tag analysis, but easier to do text matching.</li> <li>Built in validation of data formatting.</li> <li>Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document</li> <li>Generally slower than 2 and 3. </li> <li>Works well for lists of similarly-formatted items</li> <li>A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo. </li> <li>I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code. </li> </ul></li> <li><strong>Convert HTML to XHTML and use XML extraction tools.</strong> Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data. <ul> <li>Tools: TagSoup, HTMLTidy, etc</li> <li>Quality of HTML-to-XHML conversion is VERY important, and highly variable.</li> <li>Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)</li> <li>Most suitable for getting link structures, nested tables, images, lists, and so forth</li> <li>Should be faster than option 1, but slower than option 3.</li> <li>Works well if content formatting changes/is variable, but document structure/layout does not.</li> <li>If the data isn't structured by HTML tags, you're in trouble.</li> <li>Can be used with option 1.</li> </ul></li> <li><strong>Parser generator (ANTLR, etc)</strong> -- create a grammar for parsing & analyzing the page. <ul> <li>I have not tried this because it was not suitable for my (messy) pages</li> <li>Most suitable if HTML structure is highly structured, very constant, regular, and never changes. </li> <li>Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors</li> <li>Does not require XHTML input</li> <li>FASTEST throughput, generally</li> <li>Big learning curve, but easier to maintain</li> </ul></li> </ol> <p>I've tinkered with <a href="http://web-harvest.sourceforge.net/screenshots.php" rel="nofollow noreferrer">web harvest</a> for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you. </p> <hr> <p><strong>Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.</strong></p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload