Note that there are some explanatory texts on larger screens.

plurals
  1. POhow to overwrite / use cookies in scrapy
    primarykey
    data
    text
    <p>I want to scrap <a href="http://www.3andena.com/" rel="noreferrer">http://www.3andena.com/</a>, this web site starts first in Arabic, and it stores the language settings in cookies. If you tried to access the language version directly through URL (<a href="http://www.3andena.com/home.php?sl=en" rel="noreferrer">http://www.3andena.com/home.php?sl=en</a>), it makes a problem and return server error.</p> <p>So, I want to set the cookie value "store_language" to "en", then start scrap the website using this cookie values.</p> <p>I'm using CrawlSpider with a couple of Rules.</p> <p>here's the code</p> <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy import log from bkam.items import Product from scrapy.http import Request import re class AndenaSpider(CrawlSpider): name = "andena" domain_name = "3andena.com" start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"] product_urls = [] rules = ( # The following rule is for pagination Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True), # The following rule is for produt details Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True), ) def start_requests(self): yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'}) for url in self.start_urls: yield Request(url, callback=self.parse_category) def parse_category(self, response): hxs = HtmlXPathSelector(response) self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract()) for product in self.product_urls: yield Request(product, callback=self.parse_product) def parse_product(self, response): hxs = HtmlXPathSelector(response) items = [] item = Product() ''' some parsing ''' items.append(item) return items SPIDER = AndenaSpider() </code></pre> <p>Here's the log :</p> <pre><code>2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to &lt;GET http://www.3andena.com/home.php?sl=en&amp;xid_479d9=97656c0c5837f87b8c479be7c6621098&gt; from &lt;GET http://3andena.com/home.php?sl=en&gt; 2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to &lt;GET http://www.3andena.com/home.php?sl=en&amp;xid_479d9=97656c0c5837f87b8c479be7c6621098&gt; from &lt;GET http://www.3andena.com/home.php?sl=en&amp;xid_479d9=97656c0c5837f87b8c479be7c6621098&gt; 2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) &lt;GET http://www.3andena.com/Kettles/?objects_per_page=10&gt; (referer: None) 2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) &lt;GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html&gt; (referer: http://www.3andena.com/Kettles/?objects_per_page=10) </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload