StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhy wont my data save to xls?
primarykey
Id
14777673
data
AcceptedAnswerId
14781754
AnswerCount
1
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-02-08T17:04:44.653
FavoriteCount
0
LastActivityDate
2013-02-08T21:36:48.723
LastEditDate
2013-02-08T18:12:28.027
LastEditorUserId
510937
OwnerUserId
2051497
ParentId
0
PostTypeId
1
Score
0
ViewCount
999
LastEditorDisplayName
text
Body
I have written a pretty simple web scraper using scrapy. I would like to save the scraped data to an .xls file, as I have an existing module to read an xls and sort the scraped data. But I've hit what feels like a silly stumbling block, actually saving the .xls. <ul> <li>The spider itself works (it crawls and scrapes the required data)</li> <li>The .xls is being created and initialised correctly.</li> <li>The scraped data is written to the xls after scraping each item.</li> </ul> However, where ever I put the save statement, it seems to get saved before the actual web scraping begins. Leaving me with an initialised (first row filled out with titles) but otherwise empty spreadsheet. Here is what I have (website removed to save innocent server) <pre><code># encoding=utf-8 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item, Field from xlwt import Workbook # Working row on new spreadsheet row_number = 0 # Create new spreadsheet newDb = Workbook(encoding='utf-8') newFile = newDb.add_sheet('Sheet1') values = ['product','description','image'] class TestSpider(CrawlSpider): # Initiate new spreadsheet global newFile global values global row_number for cell in range (len(values)): newFile.write(row_number, cell, values[cell]) row_number = row_number + 1 # Initiate Spider name = "Test" allowed_domains = [] start_urls = ["http://www.website.to/scrape",] rules = (Rule(SgmlLinkExtractor(restrict_xpaths="//div[@class='content']/h3"), callback='parse_product'),) def parse_product(self, response): hxs = HtmlXPathSelector(response) item = TestItem() item['product'] = hxs.select('//div [@class = "col-right"][1]/table/tr[1]/td/text()').extract() item['description'] = hxs.select('//div[@class="columns"][1]/div [@class = "col-right"]/p/text()' ).extract() item['image'] = hxs.select('//img /@src').extract() global values global newFile global row_number # This is where products are written to the xls for title in values: # test to increase row_number, at the start of each new product if title == "product": row_number = row_number + 1 try: newFile.write(row_number, values.index(title), item[title] ) except: newFile.write(row_number, values.index(title), '') class TestItem(Item): product = Field() description = Field() image = Field() </code></pre> I believe I'm correct in saying just need to add <code>global newDb newDb.save('./products_out.xls')</code> In the correct place, but again, it seems no matter where I add this, print statements indicate the order of operations is always: create xls -> initialise xls -> save xls -> scrape and write to xls -> close without saving. I'm pretty new to development, and I'm at a loss on this, any advice would be gratefully received.
Tags
<python><python-2.7><scrapy><xls><xlwt>
Title
Why wont my data save to xls?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USBakuriu
UserOwnerUserId
1. USuser2051497
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId
1. COAre you sure you're getting data?
 singulars
 PostPostId
 POWhy wont my data save to xls?
 UserUserId
 USTheSentinel
2. COYes, (when pointed at the correct start_url), adding `return item` to the end of parse_product prints out collected data to the terminal as it scrapes.
 singulars
 PostPostId
 POWhy wont my data save to xls?
 UserUserId
 USuser2051497

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.