Note that there are some explanatory texts on larger screens.

plurals
  1. POCrawl all the links of a page that is password protected
    primarykey
    data
    text
    <p>I am crawling a page that requires username and password for authentication. And I successfully got the 200 OK response back from the server for that page when I passed my username and password in the code. But it gets stop as soon as it gives the 200 OK response back. <strong>It doesn't move forward in to that page after authentication to crawl all those links that are there in that page.</strong> <code>And this crawler is taken from <a href="http://code.google.com/p/crawler4j/" rel="nofollow">http://code.google.com/p/crawler4j/</a></code>. This is the code where I am doing the authentication stuff...</p> <pre><code>public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); List&lt;String&gt; exclusions; public MyCrawler() { exclusions = new ArrayList&lt;String&gt;(); //Add here all your exclusions exclusions.add("http://www.dot.ca.gov/dist11/d11tmc/sdmap/cameras/cameras.html"); } public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); DefaultHttpClient client = null; try { System.out.println("----------------------------------------"); System.out.println("WEB URL:- " +url); client = new DefaultHttpClient(); client.getCredentialsProvider().setCredentials( new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM), new UsernamePasswordCredentials("test", "test")); client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true); for(String exclusion : exclusions){ if(href.startsWith(exclusion)){ return false; } } if (href.startsWith("http://") || href.startsWith("https://")) { return true; } HttpGet request = new HttpGet(url.toString()); System.out.println("----------------------------------------"); System.out.println("executing request" + request.getRequestLine()); HttpResponse response = client.execute(request); HttpEntity entity = response.getEntity(); System.out.println(response.getStatusLine()); } catch(Exception e) { e.printStackTrace(); } return false; } public void visit(Page page) { System.out.println("hello"); int docid = page.getWebURL().getDocid(); String url = page.getWebURL().getURL(); System.out.println("Page:- " +url); String text = page.getText(); List&lt;WebURL&gt; links = page.getURLs(); int parentDocid = page.getWebURL().getParentDocid(); System.out.println("Docid: " + docid); System.out.println("URL: " + url); System.out.println("Text length: " + text.length()); System.out.println("Number of links: " + links.size()); System.out.println("Docid of parent page: " + parentDocid); } } </code></pre> <p>And this is my Controller class</p> <pre><code>public class Controller { public static void main(String[] args) throws Exception { CrawlController controller = new CrawlController("/data/crawl/root"); //And I want to crawl all those links that are there in this password protected page controller.addSeed("http://search.somehost.com/"); controller.start(MyCrawler.class, 20); controller.setPolitenessDelay(200); controller.setMaximumCrawlDepth(2); } } </code></pre> <p>Anything wrong I am doing....</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload