Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.</p> <p>If you don't already have <a href="http://www.mozilla.org/en-US/firefox/new/" rel="noreferrer">Firefox</a> and <a href="http://getfirebug.com/" rel="noreferrer">Firebug</a>, get them. Then if you don't already have <a href="https://www.google.com/intl/en/chrome/browser/" rel="noreferrer">Chrome</a>, get it. </p> <p>Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools. </p> <p>Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.</p> <p>Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.</p> <p>When you write your python, you are going to need to emulate this interaction as closely as possible. Use <a href="http://docs.python-requests.org/en/latest/" rel="noreferrer">python-requests</a>. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using <a href="http://docs.python-requests.org/en/latest/" rel="noreferrer">python-requests</a>, all of my URL fetching code shrunk by a factor of 5x.</p> <p>Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.</p> <p>If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.</p> <p>Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.</p> <p><strong>Edit:</strong> Clarify steps</p> <ol> <li>Investigate how state is being maintained</li> <li>Pull initial page with python, grab any state info you need from it</li> <li>Perform any tokenizing that may be required with that state info</li> <li>Pull the video using the tokens from steps 2 and 3</li> <li>If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug</li> <li>Return to step 1. until you find a solution</li> </ol> <p><strong>Edit:</strong> You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like <a href="http://livehttpheaders.mozdev.org/" rel="noreferrer">LiveHTTPHeaders</a>, or like has been suggested by other responders, <a href="http://www.wireshark.org/" rel="noreferrer">WireShark</a> or <a href="http://www.fiddler2.com/fiddler2/" rel="noreferrer">Fiddler</a>. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.</p> <p>I love this kind of problem</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload