Note that there are some explanatory texts on larger screens.

plurals
  1. POPython text parsing between two words
    primarykey
    data
    text
    <p>I'm using beautifulsoup and want to extract all text from between two words on a webpage. </p> <p>Ex, imagine the following website text:</p> <pre><code>This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between. </code></pre> <p>I want to pull out everything on the page that starts with <code>text</code> and ends with <code>bunch</code>. </p> <p>In this case I'd want only: </p> <pre><code>text of the webpage. It is just a string of a bunch </code></pre> <p>However, there's a chance there could be multiple instances of this on a page.</p> <p>What is the best way to do this?</p> <p>This is my current setup:</p> <pre><code>#!/usr/bin/env python from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() urls = [ http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html ] for url in urls: page = mech.open(url) html = page.read() soup = BeautifulSoup(html) text= soup.prettify() texts = soup.findAll(text=True) def visible(element): if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: # If the parent of your element is any of those ignore it return False elif re.match('&lt;!--.*--&gt;', str(element)): # If the element matches an html tag, ignore it return False else: # Otherwise, return True as these are the elements we need return True visible_texts = filter(visible, texts) # Filter only returns those items in the sequence, texts, that return True. # We use those to build our final list. for line in visible_texts: print line </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload