Note that there are some explanatory texts on larger screens.

plurals
  1. POHow do I filter all HTML tags except a certain whitelist?
    text
    copied!<p>This is for .NET. IgnoreCase is set and MultiLine is NOT set.</p> <p>Usually I'm decent at regex, maybe I'm running low on caffeine...</p> <p>Users are allowed to enter HTML-encoded entities (&lt;lt;, &lt;amp;, etc.), and to use the following HTML tags:</p> <pre><code>u, i, b, h3, h4, br, a, img </code></pre> <p>Self-closing &lt;br/&gt; and &lt;img/&gt; are allowed, with or without the extra space, but are not required.</p> <p>I want to:</p> <ol> <li>Strip all starting and ending HTML tags other than those listed above. </li> <li>Remove attributes from the remaining tags, <em>except</em> anchors can have an href.</li> </ol> <p>My search pattern (replaced with an empty string) so far:</p> <pre><code>&lt;(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^&gt;]+&gt; </code></pre> <p>This <em>seems</em> to be stripping all but the start and end tags I want, but there are three problems:</p> <ol> <li>Having to include the end tag version of each allowed tag is ugly.</li> <li>The attributes survive. Can this happen in a single replacement?</li> <li>Tags <em>starting with</em> the allowed tag names slip through. E.g., "&lt;abbrev&gt;" and "&lt;iframe&gt;".</li> </ol> <p>The following suggested pattern does not strip out tags that have no attributes.</p> <pre><code>&lt;/?(?!i|b|h3|h4|a|img)\b[^&gt;]*&gt; </code></pre> <p>As mentioned below, "&gt;" is legal in an attribute value, but it's safe to say I won't support that. Also, there will be no CDATA blocks, etc. to worry about. Just a little HTML.</p> <p>Loophole's answer is the best one so far, thanks! Here's his pattern (hoping the PRE works better for me):</p> <pre><code>static string SanitizeHtml(string html) { string acceptable = "script|link|title"; string stringPattern = @"&lt;/?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?&gt;"; return Regex.Replace(html, stringPattern, "sausage"); } </code></pre> <p>Some small tweaks I think could still be made to this answer:</p> <ol> <li><p>I think this could be modified to capture simple HTML comments (those that do not themselves contain tags) by adding "!--" to the "acceptable" variable and making a small change to the end of the expression to allow for an optional trailing "\s--".</p></li> <li><p>I think this would break if there are multiple whitespace characters between attributes (example: heavily-formatted HTML with line breaks and tabs between attributes).</p></li> </ol> <p><strong>Edit 2009-07-23:</strong> Here's the final solution I went with (in VB.NET):</p> <pre><code> Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote" Dim WhiteListPattern As String = "&lt;/?(?(?=" &amp; AcceptableTags &amp; _ ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?&gt;" html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled) </code></pre> <p>The caveat is that the HREF attribute of A tags still gets scrubbed, which is not ideal.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload