Note that there are some explanatory texts on larger screens.

plurals
  1. POScraping with BeautifulSoup and multiple paragraphs
    primarykey
    data
    text
    <p>I'm trying to scrape a speech from a website using BeautifulSoup. I'm encountering problems, however, since the speech is divided into many different paragraphs. I'm extremely new to programming and am having trouble figuring out how to deal with this. The HTML of the page looks like this:</p> <pre><code>&lt;span class="displaytext"&gt;Thank you very much. Mr. Speaker, Vice President Cheney, Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is at war; our economy is in recession; and the civilized world faces unprecedented dangers. Yet, the state of our Union has never been stronger. &lt;p&gt;We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, saved a people from starvation, and freed a country from brutal oppression. &lt;p&gt;The American flag flies again over our Embassy in Kabul. Terrorists who once occupied Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to sacrifice their lives are running for their own. </code></pre> <p>It continues on like that for awhile, with multiple paragraph tags. I'm trying to extract all of the text within the span.</p> <p>I've tried a couple of different ways to get the text, but both have failed to get the text that I want.</p> <p>The first I tried is:</p> <pre><code>import urllib2,sys from BeautifulSoup import BeautifulSoup, NavigableString address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&amp;st=&amp;st1=#axzz1fD98kGZW' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) thespan = soup.find('span', attrs={'class': 'displaytext'}) print thespan.string </code></pre> <p>which gives me:</p> <blockquote> <p>Mr. Speaker, Vice President Cheney, Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is at war; our economy is in recession; and the civilized world faces unprecedented dangers. Yet, the state of our Union has never been stronger. </p> </blockquote> <p>That is the portion of the text up until the first paragraph tag. I then tried:</p> <pre><code>import urllib2,sys from BeautifulSoup import BeautifulSoup, NavigableString address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&amp;st=&amp;st1=#axzz1fD98kGZW' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) thespan = soup.find('span', attrs={'class': 'displaytext'}) for section in thespan: paragraph = section.findNext('p') if paragraph and paragraph.string: print '&gt;', paragraph.string else: print '&gt;', section.parent.next.next.strip() </code></pre> <p>This gave me the text between the first paragraph tag and the second paragraph tag. So, I'm looking for a way to get the entire text, instead of just sections.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload