Note that there are some explanatory texts on larger screens.

plurals
  1. POBeautifulSoup not Correctly Taking all the HTML
    text
    copied!<p>I am trying to write a simple scraping programing for an academic project using BeautifulSoup and Mechanize in Python. I am trying to get the prices of some products off of Amazon because I would like to test various theories on their pricing models. The problem I am running into is that BeautifulSoup randomly does not take the whole page of HTML from Mechanize. I have printed to a text file the times that there is an error and every time the Mechanize page is fully formed, however the BeautifulSoup page is only half there. Here is my code:</p> <pre><code>def process_product_url(product_url): """Scrapes and returns all the data in the given product url""" #Download product_page given product_url product_page_mech, product_page_bs = get_product_page_mech_bs(product_url) #Extract Price price = extract_price(product_page_bs) return price def get_product_page_mech_bs(url): """Takes a product page url in str and returns the mech page and bs page""" while True: mech_page = get_mech_page(url) bs_page = BeautifulSoup(unicode(mech_page.response().read(), 'latin-1')) if not test_product_page(bs_page): log(unicode(bs_page)) log(unicode(mech_page.response().read(), 'latin-1')) continue return mech_page, bs_page def test_product_page(product_page_bs): """Takes a BS product page and tests to see if proper""" if rank_page_bs.findAll('span', attrs={'id' : 'actualPriceValue'}) == []: return False else: return True def get_mech_page(url): """Given a URL, returns Mechanize page object""" while True: try: br = initialize_browser() br.open(url) return br except Exception, e: print e print traceback.print_exc() continue def initialize_browser(): """Returns a fully setup mechanize browser instance""" br = mechanize.Browser() br.addheaders = [("User-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0.1) Gecko/20100101 Firefox/9.0.1")] return br </code></pre> <p>I have uploaded the <a href="http://pastebin.com/h3uizytY" rel="nofollow">BeautifulSoup output</a> and <a href="http://pastebin.com/i3dsuVYY" rel="nofollow">Mechanize output</a> of this page:http://www.amazon.com/Fujifilm-X-Pro-Digital-Camera-Body/dp/B006UV6YMQ/ref=sr_1_2?s=electronics&amp;ie=UTF8&amp;qid=1328359488&amp;sr=1-2 (I can't paste more than two links)</p> <p>EDIT: Clarified &amp; expanded</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload