Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Building robots isn't that hard, and there are a number of books that describe the general algorithm for doing so (a simple Google search will turn up a number of algorithms).</p> <p>The jist of it from a .NET perspecitve is to recursively:</p> <ul> <li><p>Download pages - This is done through the <a href="http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx" rel="nofollow noreferrer"><code>HttpWebRequest</code></a>/<a href="http://msdn.microsoft.com/en-us/library/system.net.httpwebresponse.aspx" rel="nofollow noreferrer"><code>HttpWebResponse</code></a>, or the <a href="http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx" rel="nofollow noreferrer"><code>WebClient</code></a> classes. Also, you can use the new <a href="http://wcf.codeplex.com/wikipage?title=WCF%20HTTP" rel="nofollow noreferrer">WCF Web API from CodePlex</a>, which is a <em>vast</em> improvement over the above, meant specifically for producing/consuming REST content, it works <em>wonderfully</em> for spidering purposes (mainly because of it's extensibility)</p></li> <li><p>Parse the downloaded content - I <em>highly</em> recommend the <a href="http://htmlagilitypack.codeplex.com/" rel="nofollow noreferrer">Html Agility Pack</a> as well as the <a href="http://code.google.com/p/fizzler/" rel="nofollow noreferrer">fizzler</a> extension for the Html Agility Pack. The Html Agility Pack will handle malformed HTML and allow you to query HTML elements using XPath (or a subset of). Additionally, fizzler will allow you to use <a href="http://www.w3.org/TR/CSS2/selector.html" rel="nofollow noreferrer">CSS selectors</a> if you are familiar with <a href="http://api.jquery.com/category/selectors/" rel="nofollow noreferrer">using them in jQuery</a>.</p></li> <li><p>Once you have the HTML in a structured format, scan the structure for the content that is relevant to you and process it.</p> <ul> <li><p>Scan the structured format for external links and place in the queue to be processed (against whatever constraints you want for your app, you aren't indexing the entire web, are you?).</p></li> <li><p>Get the next item in the queue, and repeat the process again.</p></li> </ul></li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload