StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCrawl all the links of a page that is password protected
primarykey
Id
6710190
data
AcceptedAnswerId
8717394
AnswerCount
1
ClosedDate
CommentCount
6
CommunityOwnedDate
CreationDate
2011-07-15T16:28:44.427
FavoriteCount
0
LastActivityDate
2012-01-03T19:13:22.140
LastEditDate
2011-07-15T16:50:38.410
LastEditorUserId
819916
OwnerUserId
819916
ParentId
0
PostTypeId
1
Score
3
ViewCount
1876
LastEditorDisplayName
text
Body
I am crawling a page that requires username and password for authentication. And I successfully got the 200 OK response back from the server for that page when I passed my username and password in the code. But it gets stop as soon as it gives the 200 OK response back. It doesn't move forward in to that page after authentication to crawl all those links that are there in that page. <code>And this crawler is taken from <a href="http://code.google.com/p/crawler4j/" rel="nofollow">http://code.google.com/p/crawler4j/</a></code>. This is the code where I am doing the authentication stuff... <pre><code>public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); List<String> exclusions; public MyCrawler() { exclusions = new ArrayList<String>(); //Add here all your exclusions exclusions.add("http://www.dot.ca.gov/dist11/d11tmc/sdmap/cameras/cameras.html"); } public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); DefaultHttpClient client = null; try { System.out.println("----------------------------------------"); System.out.println("WEB URL:- " +url); client = new DefaultHttpClient(); client.getCredentialsProvider().setCredentials( new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM), new UsernamePasswordCredentials("test", "test")); client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true); for(String exclusion : exclusions){ if(href.startsWith(exclusion)){ return false; } } if (href.startsWith("http://") || href.startsWith("https://")) { return true; } HttpGet request = new HttpGet(url.toString()); System.out.println("----------------------------------------"); System.out.println("executing request" + request.getRequestLine()); HttpResponse response = client.execute(request); HttpEntity entity = response.getEntity(); System.out.println(response.getStatusLine()); } catch(Exception e) { e.printStackTrace(); } return false; } public void visit(Page page) { System.out.println("hello"); int docid = page.getWebURL().getDocid(); String url = page.getWebURL().getURL(); System.out.println("Page:- " +url); String text = page.getText(); List<WebURL> links = page.getURLs(); int parentDocid = page.getWebURL().getParentDocid(); System.out.println("Docid: " + docid); System.out.println("URL: " + url); System.out.println("Text length: " + text.length()); System.out.println("Number of links: " + links.size()); System.out.println("Docid of parent page: " + parentDocid); } } </code></pre> And this is my Controller class <pre><code>public class Controller { public static void main(String[] args) throws Exception { CrawlController controller = new CrawlController("/data/crawl/root"); //And I want to crawl all those links that are there in this password protected page controller.addSeed("http://search.somehost.com/"); controller.start(MyCrawler.class, 20); controller.setPolitenessDelay(200); controller.setMaximumCrawlDepth(2); } } </code></pre> Anything wrong I am doing....
Tags
<java><web-crawler>
Title
Crawl all the links of a page that is password protected
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USAKIWEB
UserOwnerUserId
1. USAKIWEB
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POCrawl all the links of a page that is password protected
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POCrawl all the links of a page that is password protected
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POCrawl all the links of a page that is password protected
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.