StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>You should not use an XML parser to parse HTML. Use an HTML parser.</p> <p>Note that the following is perfectly valid HTML (and an XML parser would choke on it):</p> <pre><code><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Is this valid?</title> </head> <body> <p>This is a paragraph <table> <tr> <td>cell 1 <td>cell 2 <tr> <td>cell 3 <td>cell 4 </table> </body> </html> </code></pre> <p>There are many task specific (in addition to the general purpose) HTML parsers on CPAN. They have worked perfectly for me on an immense variety of extremely messy (and most of the time invalid) HTML.</p> <p>It would be possible to give specific recommendations if you can specify the problem you are trying to solve.</p> <p>There is also <a href="http://search.cpan.org/perldoc/HTML::TreeBuilder::XPath" rel="nofollow noreferrer">HTML::TreeBuilder::XPath</a> which uses <a href="http://search.cpan.org/perldoc/HTML::Parser" rel="nofollow noreferrer">HTML::Parser</a> to parse the document into a tree and then allows you to query it using XPath. I have never used it but see Randal Schwartz's <a href="http://www.stonehenge.com/merlyn/LinuxMag/col92.html" rel="nofollow noreferrer">HTML Scraping with XPath</a>.</p> <p>Given the HTML file above, the following short script:</p> <pre><code>#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree= HTML::TreeBuilder::XPath->new; $tree->parse_file("valid.html"); my @td = $tree->findnodes_as_strings('//td'); print $_, "\n" for @td; </code></pre> <p>outputs:</p> <pre> C:\Temp> z cell 1 cell 2 cell 3 cell 4 </pre> <p>The key point here is that the document was parsed by an HTML parser as an HTML document (despite the fact that we were able to query it using XPath).</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload