Note that there are some explanatory texts on larger screens.

plurals
  1. POIs there a more efficient way of querying HTML data than in an SQlite database?
    primarykey
    data
    text
    <p>Maybe my question will be closed because it is not "constructive" enough but anyway... I've already searched for answers but most of them are too general. For my master thesis project I have to crawl lots of (i.e. several thousands) webpages and entirely store them in a database. This is necessary because I have to analyze them in different ways, try out several machine learning algorithms and parse them in different ways. At the moment, I'm using an SQlite database for this purpose, in combination with Django as the preferred web framework.</p> <p>I've put the entire HTML data of a single webpage into a Django <code>TextField</code>, i.e. for each webpage there is a separate row in the database table and all the webpage's content is stored in a single column of the table. The problem now is that querying and sorting the data and especially iterating over it is very slow. I've crawled around 1000 webpages so far and the database file already is over 2 GB in size. Furthermore, my 4 GB of RAM are filled entirely and my computer (Macbook Pro mid-2009, Core2Duo 2.26 Ghz, 500 GB HDD, OSX 10.8) becomes unresponsive. If I restrict the number of items to be pre-cached, then processing the data becomes even slower because the number of queries increases. Since I have to crawl even more data, my current setup doesn't scale well enough.</p> <p><strong>My question now is the following:</strong> How can I store my HTML data more efficiently such that querying the data can be done faster? Does it make sense to switch to another RDBMS such as MySQL or PostgreSQL? Or would you recommend a document-oriented database system such as MongoDB? I only have experience with SQlite so far, so I have no idea really what to use. As the deadline for my master thesis is coming nearer, I don't have the time to try out lots of different setups.</p> <p>In order to help you in helping me, here are some further requirements:</p> <ul> <li>more performance in querying large HTML data compared to SQlite while not eating up all memory of my computer (workload cannot be distributed to other computers)</li> <li>more or less good integration with Django</li> <li>this is research work only so it will never run in a production environment but only on my computer (maybe also on my Professor's one)</li> </ul> <p>It would be great if you helped me in deciding which direction I should take because I feel somehow lost with this huge amount of possibilities. Thank you very much in advance! :)</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload