Note that there are some explanatory texts on larger screens.

plurals
  1. POBeautifulSoup drops text when fixing up broken markup
    text
    copied!<p>I'm pretty new to Python, but what the heck... This is kind of a weird question so I will do my best to explain it as throughly as I can:</p> <p>I'm busy trying to write a script in Python that checks a webpage for a specific change (a number flipping from 0 to 1 basically). When that change occurs, the script will proceed onto doing something else. Unfortunately, I have not been able to get to that point yet because I'm having trouble even parsing the HTML because a lot of the HTML is missing when <code>BeautifulSoup</code> gets a hold of it! (At least, this is what I claim.)</p> <p>Let's step through this: I'm using <code>BeautifulSoup</code> and <code>Mechanize</code> for this. First, I find a form on the webpage and select it, changing controls in the form as I need. (I have verified that all of the controls change as I expect.) After this, I submit the form and then call a helper function I wrote called <code>process_results()</code>:</p> <pre><code>... form = list(client.forms())[1] client.select_form('ttform'); ... # Modify controls ... client.submit() process_results(client) </code></pre> <p><code>process_results()</code> just checks what the client got back. First of all, depending on what was put into the form, you can get invalid search results, so I would like to search for the error message that displays on the webpage and see if it exists. I use <code>BeautifulSoup</code> to do this:</p> <pre><code># Processes search results. def process_serach_results(cli): html = cli.response().read() soup = BeautifulSoup(html) ... </code></pre> <p>The statement that evaluates if the piece of code in question appears on the page looks like:</p> <pre><code>... if (soup.find('td', attr = {'class' : 'msgarea'}) != None): # Do something... ... </code></pre> <p>This will never evaluate to be true because it cannot find the tag I'm describing. I decided to print out both the response directly from <code>Mechanize</code> and from <code>BeautifulSoup</code>, and this is what I got (shortened):</p> <p><code>Mechanize</code> prints the code I'm out to find, which means that the response is coming back correctly:</p> <pre><code>... &lt;TD class=msgarea&gt; &lt;B class=important_msg&gt;There was a problem with your request:&lt;/B&gt; &lt;BR&gt; &lt;BR&gt; &lt;li class=red_msg&gt;...&lt;/li&gt; ... &lt;/TD&gt;&lt;/TR&gt;&lt;/TABLE&gt;&lt;P&gt;&lt;/DIV&gt; ... </code></pre> <p>This is the last piece of HTML that shows up from <code>BeautifulSoup</code>:</p> <pre><code>... &lt;span class="pageheaderlinks"&gt; &lt;a ... &gt; MENU &lt;/a&gt; | &lt;a ... &gt; SITE MAP &lt;/a&gt; | &lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt; </code></pre> <p>In fact, here's that same HTML from <code>Mechanize</code>:</p> <pre><code>... &lt;SPAN class="pageheaderlinks"&gt; &lt;A ... &gt;MENU&lt;/A&gt; | &lt;A ... &gt;SITE MAP&lt;/A&gt; | &lt;--! Notice how this continues --&gt; &lt;A ... &gt;HELP&lt;/A&gt; | &lt;A ... &gt;EXIT&lt;/A&gt; &lt;/span&gt; ... </code></pre> <p>The problem is that it seems <code>BeautifulSoup</code> is omitting a large piece of HTML from the end of what <code>Mechanize</code>'s Browser is reporting. This could be a problem with how I'm going about things, but at this point, I'm incredibly lost.</p> <p>Does anyone know what could be causing this to occur? Thanks! :)</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload