Note that there are some explanatory texts on larger screens.

plurals
  1. POWeb scraper returns NoneType object for certain articles
    primarykey
    data
    text
    <p>Logical flow of the scraper: article links extracted from an XML feed are put into a list called self.raw_html. The following [simplified] method is then called to filter out the container the articles are in and remove text from the given articles:</p> <pre><code>def fetch_article_contents(self): for article in self.raw_html: self.css_selector_type == 'class': soup = article.find(self.html_element, self.css_selector) soup = soup.get_text() self.article_html.append(soup) return self.article_html </code></pre> <p>This works well on most feeds, but on two notable exemptions (Forbes and Official Google Blog) fails with the following message when get_text() is called:</p> <pre><code>AttributeError: 'NoneType' object has no attribute 'get_text' </code></pre> <p>My first logical step in debugging was to see what was returning a NoneType object, so I stuck a <code>print type(soup)</code> right before <code>soup = soup.get_text()</code>. I found:</p> <pre><code>&lt;class 'bs4.element.Tag'&gt; (25 times, condensed to save space) &lt;type 'NoneType'&gt; </code></pre> <p>This also strikes me as strange because there are currently 29 articles in <code>self.raw_html</code> when fetching the Forbes XML feed as verified by len(self.raw_html) when the class is initalized.</p> <p>The Google Official Blog returns:</p> <pre><code>&lt;class 'bs4.element.Tag'&gt; (just once this time) &lt;type 'NoneType'&gt; </code></pre> <p>and in reality has 25 fetched articles.</p> <p>What is the problem I'm encountering? Thanks!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload