Note that there are some explanatory texts on larger screens.

plurals
  1. POURL structure causing incomplete page to be returned by PHP's file_get_contents()
    text
    copied!<p>I've been doing some scraping with PHP and getting some strange results on a particular domain. For example, when I download this page:</p> <p><a href="http://pitchfork.com/reviews/tracks/" rel="nofollow noreferrer">http://pitchfork.com/reviews/tracks/</a></p> <p>It works fine. However if I try to download this page:</p> <p><a href="http://pitchfork.com/reviews/tracks/1/" rel="nofollow noreferrer">http://pitchfork.com/reviews/tracks/1/</a></p> <p>It returns an incomplete page, even though the content is exactly the same. All subsequent pages (tracks/2/, tracks/3/, etc) also return incomplete data.</p> <p>It seems to be a problem with the way the URLs are formed during pagination. Most other sections on the site exhibit the same behaviour (the landing page works, but not subsequent pages). One exception is this section:</p> <p><a href="http://pitchfork.com/forkcast/" rel="nofollow noreferrer">http://pitchfork.com/forkcast/</a></p> <p>Where forkcast/2/ etc work fine. This may be due to it being only one directory deep, where most other sections are multiple directories deep.</p> <p>I seem to have a grasp on WHAT is causing the problem, but not WHY or HOW it can be fixed.</p> <p>Any ideas?</p> <p>I have tried using file_get_contents() and cURL and both give the same result.</p> <p>Interestingly, on all the pages that do not work, the incomplete page is roughly 16,000 chars long. Is this a clue?</p> <p>I have created a test page where you can see the difference:</p> <p><a href="http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/" rel="nofollow noreferrer">http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/</a></p> <p><a href="http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/1/" rel="nofollow noreferrer">http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/1/</a></p> <p>It prints the strlen() and content of the downloaded page (plus it makes relative urls into absolute, so that CSS is correct).</p> <p>Any hints would be great!</p> <p>UPDATE: Mowser, which optimizes pages for mobile has no trouble with these pages (<a href="http://mowser.com/web/pitchfork.com/reviews/tracks/2/" rel="nofollow noreferrer">http://mowser.com/web/pitchfork.com/reviews/tracks/2/</a>) so the must be a way to do this without it failing....</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload