StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I'd recommend that you check out my answers here: <a href="https://stackoverflow.com/questions/9029822/how-can-i-bring-google-like-recrawling-in-my-applicationweb-or-console/9099798#9099798">How can I bring google-like recrawling in my application(web or console)</a> and <a href="https://stackoverflow.com/questions/5834808/designing-a-web-crawler/5834890#5834890">Designing a web crawler</a></p> <p>The first answer was provided for a C# question, but it's actually a language agnostic answer so it applies to Java too. Check out the links I've provided in both answers, there is some good reading material. I'd also say that you should try one of the already <a href="http://%20https://stackoverflow.com/questions/2495289/what-is-a-good-java-web-crawler-library" rel="nofollow noreferrer">existing java crawlers</a>, rather than writing one yourself (it's not a small project). </p> <blockquote> <p>...a web crawler in java which can take a user query about a particular news subject and then visits different news websites and then extracts news content from those pages and store it in some files/databases.</p> </blockquote> <p>That requirement seem to go beyond the scope of "just a crawler" and go into the area of machine learning and natural language processing. If you have a list of websites for which you're sure that they serve news, then you might be able to extract the news content. However, even then you have to determine what part of the website is news and what's not (i.e. there might also be links, ads, comments, etc). So exactly what kind of requirements are you facing here? Do you have a list of news websites? Do you have a reliable way to extract news?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload