Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>You want the <a href="http://nokogiri.org/Nokogiri/XML/Node.html#method-i-inner_text" rel="noreferrer"><code>Nokogiri::XML::Node#inner_text</code></a> method:</p> <pre><code>require 'nokogiri' require 'open-uri' html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357') # Alternatively html = Nokogiri::HTML(IO.read 'myfile.html') text = html.at('body').inner_text # Pretend that all words we care about contain only a-z, 0-9, or underscores words = text.scan(/\w+/) p words.length, words.uniq.length, words.uniq.sort[0..8] #=&gt; 907 #=&gt; 428 #=&gt; ["0", "1", "100", "15px", "2", "20", "2011", "220px", "24158nokogiri"] # How about words that are only letters? words = text.scan(/[a-z]+/i) p words.length, words.uniq.length, words.uniq.sort[0..5] #=&gt; 872 #=&gt; 406 #=&gt; ["Answer", "Ask", "Badges", "Browse", "DocumentFragment", "Email"] </code></pre> <pre class="lang-rb prettyprint-override"><code># Find the most frequent words require 'pp' def frequencies(words) Hash[ words.group_by(&amp;:downcase).map{ |word,instances| [word,instances.length] }.sort_by(&amp;:last).reverse ] end pp frequencies(words) #=&gt; {"nokogiri"=&gt;34, #=&gt; "a"=&gt;27, #=&gt; "html"=&gt;18, #=&gt; "function"=&gt;17, #=&gt; "s"=&gt;13, #=&gt; "var"=&gt;13, #=&gt; "b"=&gt;12, #=&gt; "c"=&gt;11, #=&gt; ... # Hrm...let's drop the javascript code out of our words html.css('script').remove words = html.at('body').inner_text.scan(/\w+/) pp frequencies(words) #=&gt; {"nokogiri"=&gt;36, #=&gt; "words"=&gt;18, #=&gt; "html"=&gt;17, #=&gt; "text"=&gt;13, #=&gt; "with"=&gt;12, #=&gt; "a"=&gt;12, #=&gt; "the"=&gt;11, #=&gt; "and"=&gt;11, #=&gt; ... </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload