Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I'll try to solve the problem properly, but it's a week-long project or so, and I don't have a free week soon. </p> <p>Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn't catch; but also, there are plenty of very real XSS vulnerabilities that aren't caught, like:</p> <pre><code>&lt;&lt;script&gt;script&gt; alert("Haha, I hacked your page."); &lt;/&lt;/script&gt;script&gt; </code></pre> <p>Probably the best thing that you can do is instead to strip out the <code>&lt;</code> element as <code>&amp;lt;</code>, to prohibit <em>all</em> HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here's what the process looks like, roughly:</p> <pre><code>_lt_ = re.compile('&lt;') _tc_ = '~(lt)~' # or whatever, so long as markdown doesn't mangle it. _ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))&gt;', re.I) _sqrt_ = re.compile(_tc_ + 'sqrt&gt;', re.I) #just to give an example of extending _endsqrt_ = re.compile(_tc_ + '/sqrt&gt;', re.I) #html syntax with your own elements. _tcre_ = re.compile(_tc_) def sanitize(text): text = _lt_.sub(_tc_, text) text = markdown(text) text = _ok_.sub(r'&lt;\1&gt;', text) text = _sqrt_.sub(r'&amp;radic;&lt;span style="text-decoration:overline;"&gt;', text) text = _endsqrt_.sub(r'&lt;/span&gt;', text) return _tcre_.sub('&amp;lt;', text) </code></pre> <p>I haven't tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload