Note that there are some explanatory texts on larger screens.

plurals
  1. POScraping countermeasures on nginx?
    primarykey
    data
    text
    <p>I'm writing some code that has to get some data from a website. Nothing controversial (I think) - it's for a kid's sports club, and it has to get their times from the national organisation's website. It's not proprietary or commercial data.</p> <p>The problem is that the returned data appears to be deliberately corrupted. I may just be being paranoid, but I've spent a few hours checking this. I'm using my own code, and I'm using the live headers Firefox extension to find out what to send to the site. I'm duplicating the <code>GET</code> headers exactly, except that I'm leaving out <code>Accept-Encoding</code>, since I don't want to handle gzip. I've tried <code>Connection</code> set to both <code>close</code> and <code>keep-alive</code>, but it makes no difference.</p> <p>The returned page has a few additional hex character sequences spread around it - nothing much, but it's enough to mess up my parsing. The characters and their location change every time I try this. My initial thought was that I was messing up the stitching together of the buffers I was getting back (I have to call <code>recv</code> maybe 20 times to get the entire page), but this doesn't seem to be the problem. The scraped version of the page always ends like this, for example:</p> <pre><code>&lt;/body&gt; 7 &lt;/html&gt; 0 </code></pre> <p>where the live page always ends <code>&lt;/body&gt;&lt;/html&gt;</code>.</p> <p>Any idea what's going on here? This site appears to be on Cloudflare/nginx. Is this something that nginx can do? Is it possible that they're messing up the text version of the page, and sending good data on the gzipped version? I'm not keen to start unzipping data.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload