Note that there are some explanatory texts on larger screens.

plurals
  1. POPython memory usage? loading large dictionaries in memory
    primarykey
    data
    text
    <p>hey all, I have a file on disk that's only 168MB. It's just a comma separated list of word,id the word can be 1-5 words long. There's 6.5 million lines. I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is? </p> <p>so let's say my word file looks like this...</p> <pre><code>1,word1 2,word2 3,word3 </code></pre> <p>then add 6.5 million to that I then loop through that file and create a dictionary (python 2.6.1)</p> <pre><code> def load_term_cache(): """will load the term cache from our cached file instead of hitting mysql. If it didn't preload into memory it would be 20+ million queries per process""" global cached_terms dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt') f = open(dumpfile) cache = csv.reader(f) for term_id, term in cache: cached_terms[term] = term_id f.close() </code></pre> <p>Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?</p> <p>thanks</p> <p>UPDATE: I tried to use the anydb module and after 4.4 million records it just dies the floating point number is the elapsed seconds since I tried to load it</p> <pre><code>56.95 3400018 60.12 3600019 63.27 3800020 66.43 4000021 69.59 4200022 72.75 4400023 83.42 4600024 168.61 4800025 338.57 </code></pre> <p>you can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.</p> <pre><code> import anydbm i=0 mark=0 starttime = time.time() dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms') db = anydbm.open(dbfile, 'c') #load from existing baseterm file termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE') for line in open(termfile): i += 1 pieces = line.split(',') db[str(pieces[1])] = str(pieces[0]) if i &gt; mark: print i print round(time.time() - starttime, 2) mark = i + 200000 db.close() </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload