StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POExtracting parts of a webpage with python
primarykey
Id
11480284
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2012-07-14T01:31:58.943
FavoriteCount
3
LastActivityDate
2012-07-20T21:15:34.733
LastEditDate
LastEditorUserId
0
OwnerUserId
1278030
ParentId
0
PostTypeId
1
Score
0
ViewCount
821
LastEditorDisplayName
text
Body
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url. Specifically, the program copies the legal statute following "Legal Authority:" on pages such as <a href="http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200904&RIN=0648-AW10" rel="nofollow">this</a>. As you can see, there is only one statute listed. However, some of the urls also look like <a href="http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16" rel="nofollow">this</a>, meaning that there are multiple separated statutes. My code works for pages of the first kind: <pre><code>from sys import argv from urllib2 import urlopen script, urlfile, legalfile = argv input = open(urlfile, "r") output = open(legalfile, "w") def get_legal(page): # this is where Legal Authority: starts in the code start_link = page.find('Legal Authority:') start_legal = page.find('">', start_link+1) end_link = page.find('<', start_legal+1) legal = page[start_legal+2: end_link] return legal for line in input: pg = urlopen(line).read() statute = get_legal(pg) output.write(get_legal(pg)) </code></pre> Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this: <pre><code>def get_legal(page): # this is where Legal Authority: starts in the code end_link = "" legal = "" start_link = page.find('Legal Authority:') while (end_link != '</a>&nbsp;'): start_legal = page.find('">', start_link+1) end_link = page.find('<', start_legal+1) end2 = page.find('</a>&nbsp;', end_link+1) legal += page[start_legal+2: end_link] if break return legal </code></pre> Since every list of statutes ends with <code>'</a>&nbsp;'</code> (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
Tags
<python>
Title
Extracting parts of a webpage with python
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USEmir
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.