StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow can I get Nokogiri to parse and return an XML document?
text
Body
copied!<p>Here's a sample of some oddness:</p> <pre><code>#!/usr/bin/ruby require 'rubygems' require 'open-uri' require 'nokogiri' print "without read: ", Nokogiri(open('http://weblog.rubyonrails.org/')).class, "\n" print "with read: ", Nokogiri(open('http://weblog.rubyonrails.org/').read).class, "\n" </code></pre> <p>Running this returns:</p> <pre><code>without read: Nokogiri::XML::Document with read: Nokogiri::HTML::Document </code></pre> <p>Without the <code>read</code> returns XML, and with it is HTML? The web page is defined as "XHTML transitional", so at first I thought Nokogiri must have been reading OpenURI's "content-type" from the stream, but that returns <code>'text/html'</code>:</p> <pre><code>(rdb:1) doc = open(('http://weblog.rubyonrails.org/')) (rdb:1) doc.content_type "text/html" </code></pre> <p>which is what the server is returning. So, now I'm trying to figure out why Nokogiri is returning two different values. It doesn't appear to be parsing the text and using heuristics to determine whether the content is HTML or XML.</p> <p>The same thing is happening with the ATOM feed pointed to by that page:</p> <pre><code>(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails')) (rdb:1) doc.class Nokogiri::XML::Document (rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails').read) (rdb:1) doc.class Nokogiri::HTML::Document </code></pre> <p>I need to be able to parse a page without knowing what it is in advance, either HTML or a feed (RSS or ATOM) and reliably determine which it is. I asked Nokogiri to parse the body of either a HTML or XML feed file, but I'm seeing those inconsistent results. </p> <p>I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:</p> <pre><code>(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails')) (rdb:1) doc.class Nokogiri::XML::Document (rdb:1) doc.xpath('/feed/entry').length 0 (rdb:1) doc.search('feed entry').length 15 </code></pre> <p>I figured xpaths would work with XML but the results don't look trustworthy either.</p> <p>These tests were all done on my Ubuntu box, but I've seen the same behavior on my Macbook Pro. I'd love to find out I'm doing something wrong, but I haven't seen an example for parsing and searching that gave me consistent results. Can anyone show me the error of my ways?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload