Note that there are some explanatory texts on larger screens.

plurals
  1. POHTTP 403 error retrieving robots.txt with mechanize
    text
    copied!<p>This shell command succeeds</p> <pre><code>$ curl -A "Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)" http://fifa-infinity.com/robots.txt </code></pre> <p>and prints robots.txt. Omitting the user-agent option results in a 403 error from the server. Inspecting the robots.txt file shows that content under <a href="http://www.fifa-infinity.com/board" rel="noreferrer">http://www.fifa-infinity.com/board</a> is allowed for crawling. However, the following fails (python code):</p> <pre><code>import logging import mechanize from mechanize import Browser ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)' br = Browser() br.addheaders = [('User-Agent', ua)] br.set_debug_http(True) br.set_debug_responses(True) logging.getLogger('mechanize').setLevel(logging.DEBUG) br.open('http://www.fifa-infinity.com/robots.txt') </code></pre> <p>And the output on my console is:</p> <pre><code>No handlers could be found for logger "mechanize.cookies" send: 'GET /robots.txt HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.fifa-infinity.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)\r\n\r\n' reply: 'HTTP/1.1 403 Bad Behavior\r\n' header: Date: Wed, 13 Feb 2013 15:37:16 GMT header: Server: Apache header: X-Powered-By: PHP/5.2.17 header: Vary: User-Agent,Accept-Encoding header: Connection: close header: Transfer-Encoding: chunked header: Content-Type: text/html Traceback (most recent call last): File "&lt;stdin&gt;", line 1, in &lt;module&gt; File "/home/moshev/Projects/forumscrawler/lib/python2.7/site-packages/mechanize/_mechanize.py", line 203, in open return self._mech_open(url, data, timeout=timeout) File "/home/moshev/Projects/forumscrawler/lib/python2.7/site-packages/mechanize/_mechanize.py", line 255, in _mech_open raise response mechanize._response.httperror_seek_wrapper: HTTP Error 403: Bad Behavior </code></pre> <p>Strangely, using curl without setting the user-agent results in "403: Forbidden" rather than "403: Bad Behavior".</p> <p>Am I somehow doing something wrong, or is this a bug in mechanize/urllib2? I don't see how simply getting robots.txt can be "bad behaviour"?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload