Note that there are some explanatory texts on larger screens.

plurals
  1. POPython multiple regular expression replace
    primarykey
    data
    text
    <p>I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).</p> <p>I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:</p> <p>unicode1.txt:</p> <pre><code>űnícődé text line1 űnícődé text line2 [empty line] űnícődé text line3 űnícődé text line4 </code></pre> <p>result have to be in unicode1.htm:</p> <pre><code>&lt;p class='aaa'&gt;&amp;#369;n&amp;iacute;c&amp;#337;d&amp;eacute; text line1&lt;/p&gt; &lt;p class='aaa'&gt;&amp;#369;n&amp;iacute;c&amp;#337;d&amp;eacute; text line2&lt;/p&gt; [empty line] &lt;p class='bbb'&gt;&amp;#369;n&amp;iacute;c&amp;#337;d&amp;eacute; text line3&lt;/p&gt; &lt;p class='bbb'&gt;&amp;#369;n&amp;iacute;c&amp;#337;d&amp;eacute; text line3&lt;/p&gt; </code></pre> <p>I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).</p> <p>V1:</p> <pre><code>import re, cgi, fileinput file="_utf8.txt" text="" for line in fileinput.input(file, inplace=0): line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "&lt;p&gt;", line, 1) text=text+re.sub(r"$", "&lt;/p&gt;", line, 1) print text </code></pre> <p>It worked, good result, but for this task fileinput is not a usable way I think.</p> <p>V2:</p> <pre><code>import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "&lt;p&gt;", line, 1) text=text+re.sub(r"$", "&lt;/p&gt;", line, 1) f.close() print text </code></pre> <p>It messed up the result, closing tag at line start replacing first letter, etc.</p> <p>V3 (tried multiline flag):</p> <pre><code>import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "&lt;p&gt;", line, 1, flags=re.M) text=text+re.sub(r"$", "&lt;/p&gt;", line, 1, flags=re.M) f.close() print text </code></pre> <p>Same result.</p> <p>V4 (tried 1 regex instead of 2):</p> <pre><code>import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') text=text+re.sub(r"^(.*)$", r"&lt;p&gt;\1&lt;/p&gt;", line, 1) f.close() print text </code></pre> <p>Same result. Please help.</p> <p>Edit: I just checked the result file with a hexeditor, and there is an x0D byte <em>before</em> each closing tag! Why?</p> <p>Edit2: changes for a more logical approach</p> <pre><code>text+=re.sub(r"^(.*)$", r"&lt;p&gt;\1&lt;/p&gt;", line, 1) </code></pre> <p>Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +</p> <pre><code># -*- coding: utf-8 -*- text="" f=u"unicode text line1\r\n unicode text line2" for line in f: text+=line print text </code></pre> <p>This results in:</p> <pre><code>unicode text line1\r\r\n unicode text line2 </code></pre> <p>Any idea, how to fix this?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload