Note that there are some explanatory texts on larger screens.

plurals
  1. POPython-Regex-Distinguish and use repeated pattern in a string
    text
    copied!<p><strong>Edit:</strong> so I came to realize, with the kind help from answers below, that parsing html with regex is generally a bad idea. For what it's worth, if someone else comes across my post someday with the same question, here's a link to two similar questions on this topic, with a far greater deal of debate and explanation that you might find useful: <a href="https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not">Using regular expressions to parse HTML: why not?</a> and this one: <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">RegEx match open tags except XHTML self-contained tags</a> </p> <p><strong>Specs:</strong> Python 3.3.1</p> <p><strong>What I was trying to do:</strong> I was writing a web page extractor to grab the weather data from a website, which for my project has 3 meaningful sections: temperature "Right Now", "Earlier Today" and "Tonight". I intend to grab these 3 numbers only and leave out all other text. In the code below I used the presence of specific html elements preceding the temperature number as pattern to help me grab the number itself. </p> <p>All the data I need is in this block of html code excerpt: (namely <code>89</code>,<code>96</code> and <code>80</code>)</p> <pre><code>&lt;div class="wx-timepart-title"&gt; Earlier Today &lt;/div&gt; &lt;div class="wx-timepart-title"&gt;Tonight&lt;/div&gt; &lt;div class="wx-data-part wx-first"&gt; &lt;img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon"&gt; &lt;/div&gt; &lt;div class="wx-data-part"&gt; &lt;img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/30.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon"&gt; &lt;/div&gt; &lt;div class="wx-data-part"&gt; &lt;img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon"&gt; &lt;/div&gt; &lt;div class="wx-data-part wx-first"&gt; &lt;div class="wx-temperature"&gt;&lt;span itemprop="temperature-fahrenheit"&gt;89&lt;/span&gt;&lt;span class="wx-degrees"&gt;&amp;deg;&lt;span class="wx-unit"&gt;F&lt;/span&gt;&lt;/span&gt;&lt;/div&gt; &lt;div class="wx-temperature-label"&gt;FEELS LIKE &lt;span itemprop="feels-like-temperature-fahrenheit"&gt;94&lt;/span&gt;&amp;deg;&lt;/div&gt; &lt;/div&gt; &lt;div class="wx-data-part"&gt; &lt;div class="wx-temperature"&gt;96&lt;span class="wx-degrees"&gt;&amp;deg;&lt;/span&gt;&lt;/div&gt; &lt;div class="wx-temperature-label"&gt;HIGH AT 4:45 PM&lt;/div&gt; &lt;/div&gt; &lt;div class="wx-data-part"&gt; &lt;div class="wx-temperature"&gt;80&lt;span class="wx-degrees"&gt;&amp;deg;&lt;/span&gt;&lt;/div&gt; &lt;div class="wx-temperature-label"&gt;LOW&lt;/div&gt; &lt;/div&gt; </code></pre> <p><strong>The solution I came up with:</strong> </p> <pre><code>import urllib.request import re # open the webpage and read the html code into a string; base = urllib.request.urlopen('http://www.weather.com/weather/today/Washington+DC+USDC0001:1:US') f = base.readlines() f = str(f) # temperature "Right Now" match1 = re.search(r'&lt;div class="wx-temperature"&gt;&lt;span itemprop="temperature-fahrenheit"&gt;\w\w',f) if match1: result1 = match1.group() right_now = result1[68:] print(right_now) # temperature "Earlier Today" match2 = re.search(r'&lt;div class="wx-temperature"&gt;\w\w',f) if match2: result2 = match2.group() ealier_today = result2[28:] print(ealier_today) # temperature "Tonight" match3 = re.search(r'&lt;div class="wx-temperature"&gt;\w\w',f) if match3: result3 = match3.group() tonight = result3[28:] print(tonight) </code></pre> <p>The three print statements are just for testing if data was grabbed correctly. </p> <p><strong>My question:</strong> problem occurred when it came to the third regex(<code>match3</code>), displaying the temperature for <code>match2</code>. I figure it's because it uses the same regex pattern as the second? So I guess my question is that how do you search for multiple results with the same regex pattern. Or is it that you can only grab the first occurrence of a pattern? I'm quite new to Python and it's my first few days into regular expression. I appreciate it if you could share some general pointers about my solution, or about my general line of thinking towards this project. Thank you! </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload