StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>What you (probably) want to use is called "recursion".</p> <p>Web pages are graphs. There are several algorithms for transversal of graphs; the simplest to understand is depth-first.</p> <p>Say your site is laid out like this (recursion terminated):</p> <pre><code>* http://example.com/ * http://example.com/ * ... * http://example.com/post/1/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/post/2/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/post/3/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/about/ * http://example.com/ * ... * http://example.com/archives/ * http://example.com/archives/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/post/1/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/post/2/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/post/3/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/post/4/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... * http://example.com/post/5/ * http://example.com/ * ... * http://example.com/about/ * ... * http://example.com/archives/ * ... </code></pre> <p>When you first hit <a href="http://example.com/" rel="nofollow noreferrer">http://example.com/</a>, you have the following links:</p> <ul> <li><a href="http://example.com/" rel="nofollow noreferrer">http://example.com/</a></li> <li><a href="http://example.com/post/1/" rel="nofollow noreferrer">http://example.com/post/1/</a></li> <li><a href="http://example.com/post/2/" rel="nofollow noreferrer">http://example.com/post/2/</a></li> <li><a href="http://example.com/post/3/" rel="nofollow noreferrer">http://example.com/post/3/</a></li> <li><a href="http://example.com/about/" rel="nofollow noreferrer">http://example.com/about/</a></li> <li><a href="http://example.com/archives/" rel="nofollow noreferrer">http://example.com/archives/</a></li> </ul> <p>You need to keep track of pages you have already visited, so you can ignore them. (Otherwise, it'd take forever to spider a page ... literally.) You add to the ignore list every time you visit a page. Right now, the only entry in the ignore list is <a href="http://example.com/" rel="nofollow noreferrer">http://example.com/</a>.</p> <p>Next, you filter out the ignored links, reducing the list to:</p> <ul> <li><a href="http://example.com/post/1/" rel="nofollow noreferrer">http://example.com/post/1/</a></li> <li><a href="http://example.com/post/2/" rel="nofollow noreferrer">http://example.com/post/2/</a></li> <li><a href="http://example.com/post/3/" rel="nofollow noreferrer">http://example.com/post/3/</a></li> <li><a href="http://example.com/about/" rel="nofollow noreferrer">http://example.com/about/</a></li> <li><a href="http://example.com/archives/" rel="nofollow noreferrer">http://example.com/archives/</a></li> </ul> <p>You then run the fetcher again on each of these links. You do this by calling your function again, with the current URL and ignore list: <code>spider($url, &$ignoredUrls)</code> (We use a reference to <code>$ignoredUrls</code> so newly ignored items are visible to the parent <code>spider</code> calls.)</p> <p>Looking at <a href="http://example.com/post/1/" rel="nofollow noreferrer">http://example.com/post/1/</a>, we see the following links:</p> <ul> <li><a href="http://example.com/" rel="nofollow noreferrer">http://example.com/</a></li> <li><a href="http://example.com/about/" rel="nofollow noreferrer">http://example.com/about/</a></li> <li><a href="http://example.com/archives/" rel="nofollow noreferrer">http://example.com/archives/</a></li> </ul> <p>We have already looked at <a href="http://example.com/" rel="nofollow noreferrer">http://example.com/</a>. The next link which isn't ignored is the about page. From the about page, we go to the archives page, where we look through each post. Each post has the same set of links:</p> <ul> <li><a href="http://example.com/" rel="nofollow noreferrer">http://example.com/</a></li> <li><a href="http://example.com/about/" rel="nofollow noreferrer">http://example.com/about/</a></li> <li><a href="http://example.com/archives/" rel="nofollow noreferrer">http://example.com/archives/</a></li> </ul> <p>Because we already visited all of those links, we return an empty array.</p> <p>Back at <code>/archives/</code>, we append the <code>/post/2/</code> link (the first non-ignored link in <code>/archives/</code>) to a <code>$foundLinks</code> local variable, as well as the return value of the call to <code>spider</code> on with <code>/post/2/</code> (which is an empty array). Then we move on to the second post.</p> <p>When we go through all our posts, we return <code>$foundLinks</code>. The <code>/about/</code> page then adds those links to its own <code>$foundLinks</code>, in addition to the <code>/about/</code> link. Flow goes back to <code>/post/1/</code>, which looks at <code>/archives/</code> (which is now ignored). The <code>/posts/1/</code> spider is now complete, and returns its own <code>$foundLinks</code>. Eventually, the original call gets all the found links.</p> <hr> <p>This method works fine for a small site which is completely closed. If you link to Wikipedia, though, you'll be spidering all day long. You can combat this problem in at least two ways:</p> <ol> <li>Terminating spidering after a certain depth (e.g. 10 links deep).</li> <li>Restricting the URL's, e.g. to a certain domain or subdomain (like <code>example.com</code>).</li> </ol> <hr> <p>Here's a quick implementation of <code>spider</code> (untested):</p> <pre><code>function get_urls($url) { // curl/DOM code here } define('SPIDER_MAX_DEPTH', 10); function spider_internal($url, &$ignoredUrls, $depth = 0) { $foundUrls = array($url); $ignoredUrls[] = $foundUrls; if($depth >= SPIDER_MAX_DEPTH) { return $foundUrls; } $links = get_links($url); foreach($links as $link) { if(array_search($link, $ignoredUrls) !== false) { continue; } $foundUrls = array_merge($foundUrls, spider($link, $ignoredUrls, $depth + 1)); } return $foundUrls; } function spider($url) { $ignoredUrls = array(); return spider_internal($url, $ignoredUrls); } </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload