Note that there are some explanatory texts on larger screens.

plurals
  1. POExtremely strange Web-Scraping issue: Post request not behaving as expected
    primarykey
    data
    text
    <p>I'm attempting to programmatically submit some data to a form on our company's admin page rather than doing it by hand. </p> <p>I've written numerous other tools which scrape this website and manipulate data. However, for some reason, this particular one is giving me a <strong>ton</strong> of trouble. </p> <h2>Walking through with a browser:</h2> <p>Below are the pages I'm attempting to scrape and post data to. Note, that these pages usually show up in js shadowboxes, however, it functions fine with Javascript disabled, so I'm assuming that javascript is not an issue with regards to the scraper trouble. </p> <p><em>(Note, since this is a company page, I've filled I've replaced all the form fields with junk titles, so, for instance, the client numbers are completely made-up)</em> </p> <p><em>Also, being that it is a company page behind a username/password wall, I can't give out the website for testing, so I've attempted in inject as much detail as possible into this post!</em></p> <p>Main entry point is here: </p> <p><img src="https://i.stack.imgur.com/xrcKp.png" alt="enter image description here"></p> <p>From this page, I click <code>"Add New form"</code>, which opens this next page in a new tag (since javascript is disabled). </p> <p><img src="https://i.stack.imgur.com/6ilkF.png" alt="enter image description here"></p> <p>On this page, I fill out the small form, click submit, which then gets the next page displaying a success message.</p> <p><img src="https://i.stack.imgur.com/HTb1b.png" alt="enter image description here"></p> <p>Should be simple, right? </p> <h2>Code attempt 1: Mechanize</h2> <pre><code>import mechanize import base64 import cookielib br = mechanize.Browser() username = 'USERNAME' password = 'PASSWORD' br.addheaders.append(('Authorization', 'Basic %s' % base64.encodestring('%s:%s' % (username, password)))) br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML,' ' like Gecko) Chrome/25.0.1364.172 Safari/537.22')] br.open('www.our_company_page.com/adm/add_forms.php') links = [link for link in br.links()] # Follow "Add a form" Link response = br.follow_link(links[0]) br.select_form(nr=0) br.form.set_all_readonly(False) br.form['formNumber'] = "FROM_PYTHON" br.form['RevisionNumber'] = ['20'] br.form['FormType'] = ['H(num)'] response = br.submit() print response.read() #Shows the exact same page! &gt;:( </code></pre> <p>So, as you can see, I attempt to duplicate the steps that I would take in a browser. I load the initial <code>/adm/forms</code> page, follow the first link, which is <code>Add a Form</code>, and fill out the form, and click the <code>submit</code> button. But here's where it get screwy. The response that mechanize returns is the exact same page with the form. No error messages, no success messages, and when I manually check our admin page, no changes have been made. </p> <h2>Inspecting Network Activity</h2> <p>Frustrated, I opened Chrome and watched the network tab as I manually filed out and submitted the form in the browser. </p> <p>Upon submitting the form, this is the network activity:</p> <p><img src="https://i.stack.imgur.com/gFsbw.png" alt="enter image description here"></p> <p>Seems pretty straight forward to me. There's the <code>post</code>, and then a <code>get</code> for the css files, and another <code>get</code> for the jquery library. There's another <code>get</code> for some kind of image, but I have no idea what that is for. </p> <h3>Inspecting the details of the POST request:</h3> <p><img src="https://i.stack.imgur.com/uDrW6.png" alt="enter image description here"></p> <p>After some Googling about scraping problems, I saw a suggestion that the server may be expecting a certain header, and the I should simply copy everything that gets made in the POST request and then slowly take away headers until I figure out which one was the important one. So I did just that, copied every bit of information in the Network tab and stuck in my post request. </p> <h2>Code Attempt 2: Urllib</h2> <p>I had some trouble figuring out all of the header stuff with <code>Mechanize</code>, so I switched over to urllib2. </p> <pre><code>import urllib import urllib2 import base64 url = 'www.our_company_page.com/adm/add_forms.php' values = { 'SID':'', #Hidden field 'FormNumber':'FROM_PYTHON1030PM', 'RevisionNumber':'5', 'FormType':'H(num)', 'fsubmit':'Save Page' } username = 'USERNAME' password = 'PASSWORD' headers = { 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding' : 'gzip,deflate,sdch', 'Accept-Language' : 'en-US,en;q=0.8', 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Authorization': 'Basic %s' % base64.encodestring('%s:%s' % (username, password)), 'Cache-Control' : 'max-age=0', 'Connection' : 'keep-alive', 'Content-Type' : 'application/x-www-form-urlencoded', 'Cookie' : 'ID=201399', 'Host' : 'our_company_page.com', 'Origin' : 'http://our_company_page.com', 'Referer' : 'http://our_company_page.com/adm/add_form.php', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, ' 'like Gecko) Chrome/26.0.1410.43 Safari/537.31' } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) print response.read() </code></pre> <p>As you can see, I added header present in Chrome's Network tab to the POST request in <code>urllib2</code>. </p> <p>One addition change from the Mechainze version is that I now access the <code>add_form.php</code> page directly by adding the relevant cookie to my Request. </p> <p>However, even with duplication everything I can, I still have the exact same issue: The response is the exact same page I started on -- no errors, no success messages, no changes on the server, just returned to a blank form. </p> <h2>Final Step: Desperation sits in, I install WireShark</h2> <p>Time to do some traffic sniffing. I'm determined to see WTF is going on in this magical post request! </p> <p>I download, install, and fire up Wireshark. I filter for <code>http</code>, and then first submit the form manually in the browser, and then run my code with attempts to submit the form programmatically. </p> <p>This is the network traffic: </p> <h2>Browser:</h2> <p><img src="https://i.stack.imgur.com/MbRPB.png" alt="enter image description here"></p> <h2>Python:</h2> <p><img src="https://i.stack.imgur.com/HJFeL.png" alt="enter image description here"></p> <p>Aside from the headers being in a slightly different order (does that matter), they look exactly the same! </p> <p>So that's where I am, completely confused as to why a <code>post</code> request, which is (as far as I can tell) nearly identical to the one made by the browser, isn't making any changes on the server. </p> <p>Has anyone ever encountered anything like this? Am I missing something obvious? What's going on here? </p> <hr> <h2>Edit</h2> <p>As per Ric's suggestion, I replicated the <code>POST</code> data exactly. I copies it directly from the Network Source tab in Chrome. </p> <p>Modified code looks as follows</p> <pre><code>data = 'SegmentID=&amp;Segment=FROMPYTHON&amp;SegmentPosition=1&amp;SegmentContains=Sections&amp;fsubmit=Save+Page' req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) print response.read() </code></pre> <p>The only thing I changed was the <code>Segment</code> value from <code>FROMBROWSER</code> to <code>FROMPYTHON</code>. </p> <p>Unfortunately, this still yields the same result. The response is the same page, I started from. </p> <h1>Update</h1> <hr> <h2>working, but not solved</h2> <p>I checked out the <code>requests</code> library, duplicated my efforts using their API, and lo' magically it worked! The POST actually went through. The question remains: <em>why</em>!? I again took another snapshot with wireshark, and as near as I can tell it is exactly the same as the POST made from the browser. </p> <h2>The Code</h2> <pre><code>def post(eventID, name, pos, containsID): segmentContains = ["Sections", "Products"] url = 'http://my_site.com/adm/add_page.php' cookies = dict(EventID=str(eventID)) payload = { "SegmentID" : "", "FormNumber" : name, "RevisionNumber" : str(pos), "FormType" : containsID, "fsubmit" : "Save Page" } r = requests.post( url, auth=(auth.username, auth.password), allow_redirects=True, cookies=cookies, data=payload) </code></pre> <h2>Wireshark output</h2> <hr> <h3>Requests</h3> <p><img src="https://i.stack.imgur.com/BSKCH.png" alt="enter image description here"></p> <h3>Browser</h3> <p><img src="https://i.stack.imgur.com/w8LkT.png" alt="enter image description here"></p> <p>So, to summarize the current state of the question. It works, but I nothing has really changed. I have no idea why attempts with both Mechanize and urllib2 failed. What is going on that allows that <code>requests</code> POST to actually go through? </p> <h1>Edit -- Wing Tang Wong suggestion:</h1> <p>At <code>Wing Tand Wongs</code> suggestion, I created a cookie handler, and attached that to the <code>urllib.opener</code>. So no more cookies are being send manually in the headers -- in fact, I don't assign anything at all now. </p> <p>I first connect to the adm page with has the link to the form, rather than connecting to the form right away. </p> <pre><code>'http://my_web_page.com/adm/segments.php?&amp;n=201399' </code></pre> <p>This gives the <code>ID</code> cookie to my <code>urllib</code> <code>cookieJar</code>. From this point I follow the link to the page that has the form, and then attempt to submit to it as usual. </p> <h3>Full Code:</h3> <pre><code>url = 'http://my_web_page.com/adm/segments.php?&amp;n=201399' post_url = 'http://my_web_page.com/adm/add_page.php' values = { 'SegmentID':'', 'Segment':'FROM_PYTHON1030PM', 'SegmentPosition':'5', 'SegmentContains':'Products', 'fsubmit':'Save Page' } username = auth.username password = auth.password headers = { 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding' : 'gzip,deflate,sdch', 'Accept-Language' : 'en-US,en;q=0.8', 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Authorization': 'Basic %s' % base64.encodestring('%s:%s' % (username, password)), 'Cache-Control' : 'max-age=0', 'Connection' : 'keep-alive', 'Content-Type' : 'application/x-www-form-urlencoded', 'Host' : 'mt_site.com', 'Origin' : 'http://my_site.com', 'Referer' : 'http://my_site.com/adm/add_page.php', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31' } COOKIEFILE = 'cookies.lwp' cj = cookielib.LWPCookieJar() if os.path.isfile(COOKIEFILE): cj.load(COOKIEFILE) opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) data = urllib.urlencode(values) req = urllib2.Request(url, headers=headers) handle = urllib2.urlopen(req) req = urllib2.Request(post_url, data, headers) handle = urllib2.urlopen(req) print handle.info() print handle.read() print if cj: print 'These are the cookies we have received so far :' for index, cookie in enumerate(cj): print index, ' : ', cookie cj.save(COOKIEFILE) </code></pre> <p>Same thing as before. No changes get made on the server. To verify that the cookies are indeed there, I print them to the console after submitting the form, which gives the output: </p> <pre><code>These are the cookies we have received so far : &lt;Cookie EventID=201399 for my_site.com/adm&gt; </code></pre> <p>So, the cookie is there, and has been sent along side the request.. so still not sure what's going on. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload