Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy wont my data save to xls?
    primarykey
    data
    text
    <p>I have written a pretty simple web scraper using scrapy. I would like to save the scraped data to an .xls file, as I have an existing module to read an xls and sort the scraped data. But I've hit what feels like a silly stumbling block, actually saving the .xls.</p> <ul> <li>The spider itself works (it crawls and scrapes the required data)</li> <li>The .xls is being created and initialised correctly.</li> <li>The scraped data is written to the xls after scraping each item.</li> </ul> <p>However, where ever I put the save statement, it seems to get saved <strong>before</strong> the actual web scraping begins. Leaving me with an initialised (first row filled out with titles) but otherwise empty spreadsheet. Here is what I have (website removed to save innocent server)</p> <pre><code># encoding=utf-8 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item, Field from xlwt import Workbook # Working row on new spreadsheet row_number = 0 # Create new spreadsheet newDb = Workbook(encoding='utf-8') newFile = newDb.add_sheet('Sheet1') values = ['product','description','image'] class TestSpider(CrawlSpider): # Initiate new spreadsheet global newFile global values global row_number for cell in range (len(values)): newFile.write(row_number, cell, values[cell]) row_number = row_number + 1 # Initiate Spider name = "Test" allowed_domains = [] start_urls = ["http://www.website.to/scrape",] rules = (Rule(SgmlLinkExtractor(restrict_xpaths="//div[@class='content']/h3"), callback='parse_product'),) def parse_product(self, response): hxs = HtmlXPathSelector(response) item = TestItem() item['product'] = hxs.select('//div [@class = "col-right"][1]/table/tr[1]/td/text()').extract() item['description'] = hxs.select('//div[@class="columns"][1]/div [@class = "col-right"]/p/text()' ).extract() item['image'] = hxs.select('//img /@src').extract() global values global newFile global row_number # This is where products are written to the xls for title in values: # test to increase row_number, at the start of each new product if title == "product": row_number = row_number + 1 try: newFile.write(row_number, values.index(title), item[title] ) except: newFile.write(row_number, values.index(title), '') class TestItem(Item): product = Field() description = Field() image = Field() </code></pre> <p>I believe I'm correct in saying just need to add<br> <code>global newDb<br> newDb.save('./products_out.xls')</code> </p> <p>In the correct place, but again, it seems no matter where I add this, print statements indicate the order of operations is always: create xls -> initialise xls -> save xls -> scrape and write to xls -> close without saving.</p> <p>I'm pretty new to development, and I'm at a loss on this, any advice would be gratefully received.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload