Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>This is merely an answer but suggestions you should consider. Your problem is far to specific to taken care of with a code-snippet or a hint.</p> <p><strong>First</strong>: try to reduce fetched data as much as possible. This said, why store the complete HTML code, I guess you are only interested in the text published on the webpage (aka content). So strip the data during the crawl from the HTML stuff and store the pure information. If I am wrong and you are interested in something else, feel free to correct me.</p> <p><strong>Second</strong>: try to produce queryable data. Your crawler should write the data in the database in a fashion you can process data easier. A map-reduce approach could be the way to go. This will take more time crawling but enables fast data retrival afterwards. This is pretty much storing only a map to all the pages you crawled and not the complete content any more or at least, your query wont touch the full data tables and only rely on the mapreduced content first.</p> <p><strong>Third</strong>: upgrade your hardware - you want to process alot of data? Be prepared (or bring time with you). Stick in as much RAM as you want and can to your Macbook (you can put ram inside right?! please say you can upgrade ram in apple stuff..) since it is really cheap</p> <p><strong>Fourth</strong>: SQLite is hdd-heavy since it relies on the OS io-cache and so on and sometimes needs ages to refetch data. If you can try to get it on an SSD drive (which will be unhealthy for the SSD in the long run ;-) ) or use a remote database with a fast connection to your pc so the hdd->ram->cpu cycle is not your limitation but only RAM and maybe CPU (I guess your program is not multi-core right?)</p> <p><strong>Fifths and final:</strong> even though I hate throwing in fancy words that are in the media everywhere now, have a look at IBMs article about <a href="http://www.ibm.com/developerworks/java/library/j-javadev2-15/index.html" rel="nofollow">hadoop</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload