Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I was able to reproduce this behavioral after installing lxml 3.1.0. Here is solution based on "monkey patching" - replacing lookup regex pattern in <code>lxml.html.clean</code> module to exclude links that has data:image/.*;base64 from removal.</p> <pre><code>import re import lxml from lxml.html.clean import Cleaner new_pattern = '\s*(?:javascript:|jscript:|livescript:|vbscript:|data:[^(?:image/.+;base64)]+|about:|mocha:)' print(new_pattern) lxml.html.clean._javascript_scheme_re = re.compile(new_pattern, re.I) cleaner = Cleaner() dochtml = """ &lt;img src="http://test.com/img.png"/&gt; &lt;img src="data:image/png;base64,aGVsbG8="/&gt; &lt;img src="data:unsafe/contents;base64,aGVsbG8="/&gt; &lt;img src="data:text/html;base64,PGh0bWw+PHNjcmlwdCB0eXBlPSJ0ZXh0L2phdmFzY3JpcHQiPmFsZXJ0KC‌​doaScpPC9zY3JpcHQ+PC9odG1sPg=="/&gt; """ r = cleaner.clean_html(dochtml) print(r) </code></pre> <p>Result</p> <pre><code>&lt;span&gt;&lt;img src="http://test.com/img.png"&gt; &lt;img src="data:image/png;base64,aGVsbG8="&gt; &lt;img src=""&gt; &lt;img src=""&gt; &lt;/span&gt; </code></pre> <p>The downside of it - it relies on internal variable name which is not announced in public interface for Cleaner. So module developers could change the name of variable or improve their version of regex.</p> <p>To be one the safe side and I would create URL handler on web server to return image contents out of database by id. So in your html doc it would be something like this <code>&lt;img src="http://myserver/showimg?id=123213"&gt;</code>. But this would involve adding lots of additional moving parts - like having web server etc. Also it won't work if it is undesirable for whole world to be have access to those images.</p> <p><em><strong>Old answer:</em></strong></p> <p>It should be possible to configure Cleaner to keep these tags, but I cannot reproduce your case - it just works for me. I'm using python 2.7.2 and lxml 2.2.8 win-32. Please clarify what python and lxml version do you have? </p> <p>I tried to run your example and got back second image tag contents that were not removed </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload