Note that there are some explanatory texts on larger screens.

plurals
  1. POScraping Multiple html files to CSV
    primarykey
    data
    text
    <p>I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.</p> <p>This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.</p> <p>How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?</p> <p>I just need rows that start with this "<code>&lt;tr class="evenColor"&gt;</code>" and end with this "<code>&lt;/tr&gt;</code>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "<strong>GOAL</strong>" is in bold so do I have to specify this? There are 3 tables per htm file.</p> <p>Also I would like the name of the parent file (pl020001.htm) <strong>to be included in the rows I scrape so I can id them in their own column in the final database</strong>. I don't even know where to begin for that. This is what I have so far: </p> <pre><code>#/usr/bin/python from BeautifulSoup import BeautifulSoup import re from mechanize import Browser mech = Browser() url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM" ##but how do I do multiple urls/files? PL02*.HTM? page = mech.open(url) html = page.read() soup = BeautifulSoup(html) ##this confuses me and seems redundant pl = open("input_file.html","r") chances = open("chancesforsql.csv,"w") table = soup.find("table", border=0) for row in table.findAll 'tr class="evenColor"' #should I do this instead of before? outfile = open("shooting.csv", "w") ##how do I end it? </code></pre> <p>Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload