Note that there are some explanatory texts on larger screens.

plurals
  1. PODifferences in re.findall and re.finditer -- bug in Python 2.7 re module?
    primarykey
    data
    text
    <p>While demonstrating Python's regex functionality, I wrote a small program to compare the return values of <code>re.search()</code>, <code>re.findall()</code> and <code>re.finditer()</code>. I'm aware that <code>re.search()</code> will only find one match per line and that <code>re.findall()</code> only returns the matched substring(s) and not any location information. However, I was surprised see to see that the matched substring can differ between the three functions.</p> <p>Code (<a href="https://gist.github.com/palday/7805267" rel="nofollow">available on GitHub</a>):</p> <pre><code>#! /usr/bin/env python # -*- coding: utf-8 -*- # License: CC-BY-NC-SA 3.0 import re import codecs # download kate_chopin_the_awakening_and_other_short_stories.txt # from Project Gutenberg: # http://www.gutenberg.org/ebooks/160.txt.utf-8 # with wget: # wget http://www.gutenberg.org/ebooks/160.txt.utf-8 -O kate_chopin_the_awakening_and_other_short_stories.txt # match for something o'clock, with valid numerical time or # any English word with proper capitalization oclock = re.compile(r""" ( [A-Z]?[a-z]+ # word mit max. 1 capital letter | 1[012] # 10,11,12 | [1-9] # 1,2,3,5,6,7,8,9 ) \s o'clock""", re.VERBOSE) path = "kate_chopin_the_awakening_and_other_short_stories.txt" print print "re.search()" print print u"{:&gt;6} {:&gt;6} {:&gt;6}\t{}".format("Line","Start","End","Match") print u"{:=&gt;6} {:=&gt;6} {:=&gt;6}\t{}".format('','','','=====') with codecs.open(path,mode='r',encoding='utf-8') as f: for lineno, line in enumerate(f): atime = oclock.search(line) if atime: print u"{:&gt;6} {:&gt;6} {:&gt;6}\t{}".format(lineno, atime.start(), atime.end(), atime.group()) print print "re.findall()" print print u"{:&gt;6} {:&gt;6} {:&gt;6}\t{}".format("Line","Start","End","Match") print u"{:=&gt;6} {:=&gt;6} {:=&gt;6}\t{}".format('','','','=====') with codecs.open(path,mode='r',encoding='utf-8') as f: for lineno, line in enumerate(f): times = oclock.findall(line) if times: print u"{:&gt;6} {:&gt;6} {:&gt;6}\t{}".format(lineno, '', '', ' '.join(times)) print print "re.finditer()" print print u"{:&gt;6} {:&gt;6} {:&gt;6}\t{}".format("Line","Start","End","Match") print u"{:=&gt;6} {:=&gt;6} {:=&gt;6}\t{}".format('','','','=====') with codecs.open(path,mode='r',encoding='utf-8') as f: for lineno, line in enumerate(f): times = oclock.finditer(line) for m in times: print u"{:&gt;6} {:&gt;6} {:&gt;6}\t{}".format(lineno, m.start(), m.end(), m.group()) </code></pre> <p>and Output (tested on Python 2.7.3 and 2.7.5):</p> <pre><code>re.search() Line Start End Match ====== ====== ====== ===== 248 7 21 eleven o'clock 1520 24 35 one o'clock 1975 21 33 nine o'clock 2106 4 16 four o'clock 4443 19 30 ten o'clock re.findall() Line Start End Match ====== ====== ====== ===== 248 eleven 1520 one 1975 nine 2106 four 4443 ten re.finditer() Line Start End Match ====== ====== ====== ===== 248 7 21 eleven o'clock 1520 24 35 one o'clock 1975 21 33 nine o'clock 2106 4 16 four o'clock 4443 19 30 ten o'clock </code></pre> <p>What am I missing something here? Why doesn't <code>re.findall()</code> return the <code>o'clock</code> bit?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload