Note that there are some explanatory texts on larger screens.

plurals
  1. POLoading a large dictionary using python pickle
    primarykey
    data
    text
    <p>I have a full inverted index in form of nested python dictionary. Its structure is :</p> <pre><code>{word : { doc_name : [location_list] } } </code></pre> <p>For example let the dictionary be called index, then for a word " spam ", entry would look like :</p> <pre><code>{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } } </code></pre> <p>I used this structure as python dict are pretty optimised and it makes programming easier.</p> <p>for any word 'spam', the documents containig it can be given by :</p> <pre><code>index['spam'].keys() </code></pre> <p>and posting list for a document doc1 by:</p> <pre><code>index['spam']['doc1'] </code></pre> <p>At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using <em>time.time()</em>) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.</p> <p><code>len(index.keys())</code> gives 229758</p> <h2>Code</h2> <pre><code>import cPickle as pickle f = open('full_index','rb') print 'Loading index... please wait...' index = pickle.load(f) # This takes ages print 'Index loaded. You may now proceed to search' </code></pre> <p><strong>How can I make it load faster?</strong> I only need to load it once, when the application starts. After that, the access time is important to respond to queries. </p> <p>Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?</p> <h2>Addendum</h2> <p>Using Tim's answer <code>pickle.dump(index, file, -1)</code> the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... <em>time.time()</em>) </p> <p>But should I migrate to a database for scalability ? </p> <p>As for now I am marking Tim's answer as accepted. </p> <p>PS :I don't want to use Lucene or Xapian ... This question refers <a href="https://stackoverflow.com/questions/3687715/storing-an-inverted-index">Storing an inverted index</a> . I had to ask a new question because I wasn't able to delete the previous one.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload