Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Here is what i use in my own project. The acceptable_elements/attributes come from <a href="http://pythonhosted.org/feedparser/html-sanitization.html" rel="noreferrer">feedparser</a> and BeautifulSoup does the work.</p> <pre><code>from BeautifulSoup import BeautifulSoup acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big', 'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col', 'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em', 'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var'] acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey', 'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing', 'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols', 'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace', 'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method', 'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size', 'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type', 'usemap', 'valign', 'value', 'vspace', 'width'] def clean_html( fragment ): while True: soup = BeautifulSoup( fragment ) removed = False for tag in soup.findAll(True): # find all tags if tag.name not in acceptable_elements: tag.extract() # remove the bad ones removed = True else: # it might have bad attributes # a better way to get all attributes? for attr in tag._getAttrMap().keys(): if attr not in acceptable_attributes: del tag[attr] # turn it back to html fragment = unicode(soup) if removed: # we removed tags and tricky can could exploit that! # we need to reparse the html until it stops changing continue # next round return fragment </code></pre> <p>Some small tests to make sure this behaves correctly:</p> <pre><code>tests = [ #text should work ('&lt;p&gt;this is text&lt;/p&gt;but this too', '&lt;p&gt;this is text&lt;/p&gt;but this too'), # make sure we cant exploit removal of tags ('&lt;&lt;script&gt;&lt;/script&gt;script&gt; alert("Haha, I hacked your page."); &lt;&lt;script&gt;&lt;/script&gt;/script&gt;', ''), # try the same trick with attributes, gives an Exception ('&lt;div on&lt;script&gt;&lt;/script&gt;load="alert("Haha, I hacked your page.");"&gt;1&lt;/div&gt;', Exception), # no tags should be skipped ('&lt;script&gt;bad&lt;/script&gt;&lt;script&gt;bad&lt;/script&gt;&lt;script&gt;bad&lt;/script&gt;', ''), # leave valid tags but remove bad attributes ('&lt;a href="good" onload="bad" onclick="bad" alt="good"&gt;1&lt;/div&gt;', '&lt;a href="good" alt="good"&gt;1&lt;/a&gt;'), ] for text, out in tests: try: res = clean_html(text) assert res == out, "%s =&gt; %s != %s" % (text, res, out) except out, e: assert isinstance(e, out), "Wrong exception %r" % e </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload