Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I'm the author of the "<em>mygod, he has written a python interpreter using regex...</em>" (i.e. pyminifier) mentioned <a href="https://stackoverflow.com/questions/1769332/script-to-remove-python-comments-docstrings#comment1653880_1769362">at that link below</a> =).<br> I just wanted to chime in and say that I've improved the code quite a bit using the tokenizer module (which I discovered thanks to this question =) ). </p> <p>You'll be happy to note that the code no longer relies so much on regular expressions and uses tokenizer to great effect. Anyway, here's the <code>remove_comments_and_docstrings()</code> function from pyminifier<br> (Note: It works properly with the edge cases that previously-posted code breaks on):</p> <pre><code>import cStringIO, tokenize def remove_comments_and_docstrings(source): """ Returns 'source' minus comments and docstrings. """ io_obj = cStringIO.StringIO(source) out = "" prev_toktype = tokenize.INDENT last_lineno = -1 last_col = 0 for tok in tokenize.generate_tokens(io_obj.readline): token_type = tok[0] token_string = tok[1] start_line, start_col = tok[2] end_line, end_col = tok[3] ltext = tok[4] # The following two conditionals preserve indentation. # This is necessary because we're not using tokenize.untokenize() # (because it spits out code with copious amounts of oddly-placed # whitespace). if start_line &gt; last_lineno: last_col = 0 if start_col &gt; last_col: out += (" " * (start_col - last_col)) # Remove comments: if token_type == tokenize.COMMENT: pass # This series of conditionals removes docstrings: elif token_type == tokenize.STRING: if prev_toktype != tokenize.INDENT: # This is likely a docstring; double-check we're not inside an operator: if prev_toktype != tokenize.NEWLINE: # Note regarding NEWLINE vs NL: The tokenize module # differentiates between newlines that start a new statement # and newlines inside of operators such as parens, brackes, # and curly braces. Newlines inside of operators are # NEWLINE and newlines that start new code are NL. # Catch whole-module docstrings: if start_col &gt; 0: # Unlabelled indentation means we're inside an operator out += token_string # Note regarding the INDENT token: The tokenize module does # not label indentation inside of an operator (parens, # brackets, and curly braces) as actual indentation. # For example: # def foo(): # "The spaces before this docstring are tokenize.INDENT" # test = [ # "The spaces before this string do not get a token" # ] else: out += token_string prev_toktype = token_type last_col = end_col last_lineno = end_line return out </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload