StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Nokogiri is fixing up the malformed HTML in order to make it parseable. After it has finished the DOM is in a reasonable state, but the original document isn't available from Nokogiri any more.</p> <p>If you want the original to be untouched, you have to make it valid prior to passing it to Nokogiri, then you can manipulate it using Nokogiri's methods. Typically I'd do that using some regex to find the trouble spots and add/adjust tags or their associated closing tags, to allow Nokogiri to parse without needing to fix things.</p> <p>It's not a case of HTML being smarter than XML, it's a case of Nokogiri honoring the spirit of the XML specification, which is rigid, and raising flags by populating the <code>errors</code> array with the errors when the file is invalid. HTML has a less rigid specification, and, because browsers are (too) forgiving when parsing and displaying HTML, Nokogiri follows along somewhat, does fixups, and then populates the <code>errors</code> array. (In either case, you can check that array to see what's wrong.)</p> <pre><code>require 'nokogiri' orig_html = ' <html> <meta name="Generator" content="Microsoft Word 97 O.o"> <body> 1 <b><p>2</p></b> 3 </body> </html>' doc = Nokogiri::HTML(orig_html) doc.errors </code></pre> <p><code>doc.errors</code> contains:</p> <pre><code>[ [0] #<Nokogiri::XML::SyntaxError: Unexpected end tag : b> ] </code></pre> <p>Here's how I'd use Nokogiri to fix your sample HTML:</p> <pre><code>doc = Nokogiri::HTML(orig_html) p = doc.at('b+p') p.previous_sibling.remove </code></pre> <p>This is the HTML at this point:</p> <pre><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="Generator" content="Microsoft Word 97 O.o"> </head> <body> 1 <p>2</p> 3 </body> </html> </code></pre> <p>Continuing:</p> <pre><code>p.inner_html = "<b>#{p.content}</b>" puts doc.to_html </code></pre> <p>This is the resulting HTML:</p> <pre><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="Generator" content="Microsoft Word 97 O.o"> </head> <body> 1 <p><b>2</b></p> 3 </body> </html> </code></pre> <p>It's pretty obvious the sample HTML isn't what you're really working with, so you'll have to change the accessors to locate the tags that need to be changed, but that should get you going.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload