Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>As you have guessed, this website uses javascript to load more items when you scroll the page.</p> <p>Using the developers tools included in my browser (Ctrl-Maj i for chromium), I saw in the Network tab that the javascript script included in the page performs the following requests to load more items :</p> <pre><code>GET http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc... </code></pre> <p>The web server responds with documents of the following type :</p> <pre><code>&lt;li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 "&gt; &lt;div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);"&gt;Quick view&lt;/div&gt; &lt;div class="itm-qlInsert tooltip-qlist highlightStar" onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog'); return false;" &gt; &lt;div class="starHrMsg"&gt; &lt;span class="starHrMsgArrow"&gt;&amp;nbsp;&lt;/span&gt; Save for later &lt;/div&gt; &lt;/div&gt; &lt;a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html" onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);"&gt; &lt;span class="lazyImage"&gt; &lt;span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4"&gt; &lt;noscript&gt;&lt;img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"&gt;&lt;/noscript&gt; &lt;/span&gt; &lt;/span&gt; &lt;span class="itm-budgeFlag offInside"&gt;&lt;span class="flagBrdLeft"&gt;&lt;/span&gt;OFFER INSIDE&lt;/span&gt; &lt;span class="itm-Catbrand strong"&gt;Phosphorus&lt;/span&gt; &lt;span class="itm-title"&gt; Black Moccasins &lt;/span&gt; </code></pre> <p>These documents contain more items.</p> <p>So, to get the full list of items you will have to return <code>Request</code> objects in the <code>parse</code> method of your Spider (See the <a href="http://doc.scrapy.org/en/0.16/topics/spiders.html#scrapy.spider.BaseSpider.parse" rel="nofollow">Spider class documentation</a>), to tell scrapy that it should load more data :</p> <pre><code>def parse(self, response): # ... Extract items in the page using extractors n = number of the next "page" to parse # You get get n by using response.url, extracting the number # at the end and adding 1 # It is VERY IMPORTANT to set the Referer and X-Requested-With headers # here because that's how the website detects if the request was made by javascript # or direcly by following a link. req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n, headers = {"Referer": "http://www.website-your-are-crawling.com/men/shoes/", "X-Requested-With": "XMLHttpRequest"}) return req # and your items </code></pre> <p>Oh, and by the way (in case you want to test), you can't just load <code>http://www.website-your-are-crawling.com/men/shoes/?page=2</code> in your browser to see what it returns because the website will redirect you to the global page (ie <code>http://www.website-your-are-crawling.com/men/shoes/</code>) if the <code>X-Requested-With</code> header is different from <code>XMLHttpRequest</code>.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload