Note that there are some explanatory texts on larger screens.

plurals
  1. POFetching cookie enabled page in python
    text
    copied!<p>I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. <strong>I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!</strong></p> <p>This is how I go over it now:</p> <pre><code>import requests import cookielib cj = cookielib.CookieJar() user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'} #first request to get the cookies requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&amp;SiteId=1&amp;Page=HRS_CE_JOB_DTL&amp;PostingSeq=1&amp;',headers=user_agent, timeout=2, cookies = cj) # second request reusing cookies served first time r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&amp;SiteId=1&amp;Page=HRS_CE_JOB_DTL&amp;PostingSeq=1&amp;',headers=user_agent, timeout=2, cookies = cj) html_text = r.text </code></pre> <p>Basically, I create a <code>CookieJar</code> object and then send two consecutive requests for the same URL. <strong>First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.</strong></p> <p>The question is: <strong>Is it possible to just use one request and still get the right cookie enabled version of a page?</strong></p> <p>I tried to send <code>HEAD</code> request first time instead of <code>GET</code> to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either. So, it is interesting to understand how to make it efficiently! Any ideas?!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload