StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POcron job fails in gae python
text
Body
copied!<p>I have a script in Google Appengine that is started every 20 minutes by cron.yaml. This works locally, on my own machine. When I go (manually) to the url which starts the script online, it also works. However, the script always fails to complete online, on Google's instances, when cron.yaml is in charge of starting it.</p> <p>The log shows no errors, only 2 debug messages:</p> <pre><code>D 2013-07-23 06:00:08.449 type(soup): <class 'bs4.BeautifulSoup'> END type(soup) D 2013-07-23 06:00:11.246 type(soup): <class 'bs4.BeautifulSoup'> END type(soup) </code></pre> <p>Here's my script:</p> <pre><code># coding: utf-8 import jinja2, webapp2, urllib2, re from bs4 import BeautifulSoup as bs from google.appengine.api import memcache from google.appengine.ext import db class Article(db.Model): content = db.TextProperty() datetime = db.DateTimeProperty(auto_now_add=True) companies = db.ListProperty(db.Key) url = db.StringProperty() class Company(db.Model): name = db.StringProperty() ticker = db.StringProperty() @property def articles(self): return Article.gql("WHERE companies = :1", self.key()) def companies_key(companies_name=None): return db.Key.from_path('Companies', companies_name or 'default_companies') def articles_key(articles_name=None): return db.Key.from_path('Articles', articles_name or 'default_articles') def scrape(): companies = memcache.get("companies") if not companies: companies = Company.all() memcache.add("companies",companies,30) for company in companies: links = links(company.ticker) links = set(links) for link in links: if link is not "None": article_object = Article() text = fetch(link) article_object.content = text article_object.url = link article_object.companies.append(company.key()) #doesn't work. article_object.put() def fetch(link): try: html = urllib2.urlopen(url).read() soup = bs(html) except: return "None" text = soup.get_text() text = text.encode('utf-8') text = text.decode('utf-8') text = unicode(text) if text is not "None": return text else: return "None" def links(ticker): url = "https://www.google.com/finance/company_news?q=NASDAQ:" + ticker + "&start=10&num=10" html = urllib2.urlopen(url).read() soup = bs(html) div_class = re.compile("^g-section.*") divs = soup.find_all("div", {"class" : div_class}) links = [] for div in divs: a = unicode(div.find('a', attrs={'href': re.compile("^http://")})) link_regex = re.search("(http://.*?)\"",a) try: link = link_regex.group(1) soup = bs(link) link = soup.get_text() except: link = "None" links.append(link) return links </code></pre> <p>...and the script's handler in main:</p> <pre><code>class ScrapeHandler(webapp2.RequestHandler): def get(self): scrape.scrape() self.redirect("/") </code></pre> <p>My guess is that the problem might be the double for loop in the scrape script, but I don't understand exactly why.</p> <p><strong>Update:</strong> Articles are indeed being scraped (as many as there should be), and now there are no log errors, or even debug messages at all. Looking at the log, the cron job seemed to execute perfectly. Even so, Appengine's cron job panel says the cron job failed.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload