StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>By using a <a href="http://en.wikipedia.org/wiki/Web_crawler" rel="noreferrer">WebCrawler</a>, e.g. one of these: </p> <ul> <li>DataparkSearch is a crawler and search engine released under the GNU General Public License.</li> <li>GNU Wget is a command-line operated crawler written in C and released under the GPL. It is typically used to mirror web and FTP sites.</li> <li>HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.</li> <li>ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl websites based on Website Parse Templates using computer's free CPU resources only.</li> <li>JSpider is a highly configurable and customizable web spider engine released under the GPL.</li> <li>Larbin by Sebastien Ailleret</li> <li>Webtools4larbin by Andreas Beder</li> <li>Methabot is a speed-optimized web crawler and command line utility written in C and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP.</li> <li>Jaeksoft WebSearch is a web crawler and indexer build over Apache Lucene. It is released under the GPL v3 license.</li> <li>Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package.</li> <li>Pavuk is a command line web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, eg. regular expression based filtering and file creation rules.</li> <li>WebVac is a crawler used by the Stanford WebBase Project.</li> <li>WebSPHINX (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.</li> <li>WIRE - Web Information Retrieval Environment [15] is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization.</li> <li>LWP::RobotUA (Langheinrich , 2004) is a Perl class for implementing well-behaved parallel web robots distributed under Perl 5's license.</li> <li>Web Crawler Open source web crawler class for .NET (written in C#).</li> <li>Sherlock Holmes Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal Centrum. It is also used by Onet.pl.</li> <li>YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).</li> <li>Ruya Ruya is an Open Source, high performance breadth-first, level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the GPL and is written entirely in the Python language. A SingleDomainDelayCrawler implementation obeys robots.txt with a crawl delay.</li> <li>Universal Information Crawler Fast developing web crawler. Crawls Saves and analyzes the data.</li> <li>Agent Kernel A Java framework for schedule, thread, and storage management when crawling.</li> <li>Spider News, Information regarding building a spider in perl.</li> <li>Arachnode.NET, is an open source promiscuous Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2005 and is released under the GPL.</li> <li>dine is a multithreaded Java HTTP client/crawler that can be programmed in JavaScript released under the LGPL.</li> <li>Crawljax is an Ajax crawler based on a method which dynamically builds a `state-flow graph' modeling the various navigation paths and states within an Ajax application. Crawljax is written in Java and released under the BSD License.</li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload