Note that there are some explanatory texts on larger screens.

plurals
  1. POStripping HTML from text containing < and > characters with Loofah and Nokogiri
    primarykey
    data
    text
    <p>I imagine this is common enough that it's a solved problem, but being a bit of a newbie with Loofah and Nokogiri I haven't found the solution yet.</p> <p>I'm using Loofah, a HTML scrubber library that wraps Nokogiri, to scrub some HTML text for display. However, that text sometimes happen to things like e-mail addresses and such between <code>&lt;</code> and <code>&gt;</code> characters, for example, <code>&lt; foo@domain.com &gt;</code>. Loofah is considering that as an HTML or XML tag, and is stripping it away from the text.</p> <p>Is there a way to prevent this from happening while still doing a good job of scrubbing away the actual tags?</p> <p>Edit: Here's a failing test case:</p> <pre><code>require 'test/unit' require 'test/unit/ui/console/testrunner' require 'nokogiri' MAGICAL_REGEXP = /&lt;([^(?:\/|!\-\-)].*)&gt;/ def filter_html(content) # Current approach in a gist: We capture content enclosed in angle brackets. # Then, we check if the excerpt right after the opening bracket is a valid HTML # tag. If it's not, we substitute the matched content (which is the captured # content enclosed in angle brackets) for the captured content enclosed in # the HTML entities for the angle brackets. This does not work with nested # HTML tags, since regular expressions are not meant for this. content.to_s.gsub(MAGICAL_REGEXP) do |excerpt| capture = $1 Nokogiri::HTML::ElementDescription[capture.split(/[&lt;&gt; ]/).first] ? excerpt : "&amp;lt;#{capture}&amp;gt;" end end class HTMLTest &lt; Test::Unit::TestCase def setup @raw_html = &lt;&lt;-EOS &lt;html&gt; &lt;foo@bar.baz&gt; &lt;p&gt;&lt;foo@&lt;b class="highlight"&gt;bar&lt;/b&gt;.baz&gt;&lt;/p&gt; &lt;p&gt; &lt;foo@&lt;b class="highlight"&gt;bar&lt;/b&gt;.baz&gt; &lt;/p&gt; &lt; don't erase this &gt; &lt;/html&gt; EOS @filtered_html = &lt;&lt;-EOS &lt;html&gt; &amp;lt;foo@bar.baz&amp;gt; &lt;p&gt;&amp;lt;foo@&lt;b class="highlight"&gt;bar&lt;/b&gt;.baz&amp;gt;&lt;/p&gt; &lt;p&gt; &amp;lt;foo@&lt;b class="highlight"&gt;bar&lt;/b&gt;.baz&amp;gt; &lt;/p&gt; &amp;lt; don't erase this &amp;gt; &lt;/html&gt; EOS end def test_filter_html assert_equal(@filtered_html, filter_html(@raw_html)) end end # Can you make this test pass? Test::Unit::UI::Console::TestRunner.run(HTMLTest) </code></pre> <p>We're currently using some pretty evil regex hackery to try and accomplish this, but as the comment above states, it doesn't work for tags "nested" inside non-tags. And we actually want to preserve the <code>&lt;b class="highlight"&gt;</code> elements as well.</p> <p>The sample below isn't using Loofah, but the application itself does in other places so it wouldn't be hard to add it here. We're just not sure of what configuration options we should use, if any.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload