StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POScraping Multiple html files to CSV
primarykey
Id
923318
data
AcceptedAnswerId
923373
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2009-05-28T21:34:57.490
FavoriteCount
3
LastActivityDate
2009-05-29T01:05:29.237
LastEditDate
LastEditorUserId
0
OwnerUserId
113989
ParentId
0
PostTypeId
1
Score
2
ViewCount
2282
LastEditorDisplayName
text
Body
I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process. This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks. How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230? I just need rows that start with this "<code><tr class="evenColor"></code>" and end with this "<code></tr></code>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file. Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far: <pre><code>#/usr/bin/python from BeautifulSoup import BeautifulSoup import re from mechanize import Browser mech = Browser() url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM" ##but how do I do multiple urls/files? PL02*.HTM? page = mech.open(url) html = page.read() soup = BeautifulSoup(html) ##this confuses me and seems redundant pl = open("input_file.html","r") chances = open("chancesforsql.csv,"w") table = soup.find("table", border=0) for row in table.findAll 'tr class="evenColor"' #should I do this instead of before? outfile = open("shooting.csv", "w") ##how do I end it? </code></pre> Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?
Tags
<python><sqlite><screen-scraping><beautifulsoup><mechanize>
Title
Scraping Multiple html files to CSV
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USnorthnodewolf
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POScraping Multiple html files to CSV
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POScraping Multiple html files to CSV
 UserUserId
 USIan Vaughan
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POScraping Multiple html files to CSV
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.