Note that there are some explanatory texts on larger screens.

plurals
  1. POExtracting parts of a webpage with python
    primarykey
    data
    text
    <p>So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url. </p> <p>Specifically, the program copies the legal statute following "Legal Authority:" on pages such as <a href="http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200904&amp;RIN=0648-AW10" rel="nofollow">this</a>. As you can see, there is only one statute listed. However, some of the urls also look like <a href="http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&amp;RIN=1205-AB16" rel="nofollow">this</a>, meaning that there are multiple separated statutes. </p> <p>My code works for pages of the first kind:</p> <pre><code>from sys import argv from urllib2 import urlopen script, urlfile, legalfile = argv input = open(urlfile, "r") output = open(legalfile, "w") def get_legal(page): # this is where Legal Authority: starts in the code start_link = page.find('Legal Authority:') start_legal = page.find('"&gt;', start_link+1) end_link = page.find('&lt;', start_legal+1) legal = page[start_legal+2: end_link] return legal for line in input: pg = urlopen(line).read() statute = get_legal(pg) output.write(get_legal(pg)) </code></pre> <p>Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:</p> <pre><code>def get_legal(page): # this is where Legal Authority: starts in the code end_link = "" legal = "" start_link = page.find('Legal Authority:') while (end_link != '&lt;/a&gt;&amp;nbsp;'): start_legal = page.find('"&gt;', start_link+1) end_link = page.find('&lt;', start_legal+1) end2 = page.find('&lt;/a&gt;&amp;nbsp;', end_link+1) legal += page[start_legal+2: end_link] if break return legal </code></pre> <p>Since every list of statutes ends with <code>'&lt;/a&gt;&amp;nbsp;'</code> (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload