Note that there are some explanatory texts on larger screens.

plurals
  1. POwhen does Python allocate new memory for identical strings?
    primarykey
    data
    text
    <p>Two Python strings with the same characters, a == b, may share memory, id(a) == id(b), or may be in memory twice, id(a) != id(b). Try</p> <pre><code>ab = "ab" print id( ab ), id( "a"+"b" ) </code></pre> <p>Here Python recognizes that the newly created "a"+"b" is the same as the "ab" already in memory -- not bad.</p> <p>Now consider an N-long list of state names [ "Arizona", "Alaska", "Alaska", "California" ... ] (N ~ 500000 in my case).<br> I see 50 different id() s &rArr; each string "Arizona" ... is stored only once, fine.<br> BUT write the list to disk and read it back in again: the "same" list now has N different id() s, way more memory, see below.</p> <p>How come -- can anyone explain Python string memory allocation ?</p> <pre><code>""" when does Python allocate new memory for identical strings ? ab = "ab" print id( ab ), id( "a"+"b" ) # same ! list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once but list &gt; file &gt; mem again: N ids, mem ~ N * (4 + S) """ from __future__ import division from collections import defaultdict from copy import copy import cPickle import random import sys states = dict( AL = "Alabama", AK = "Alaska", AZ = "Arizona", AR = "Arkansas", CA = "California", CO = "Colorado", CT = "Connecticut", DE = "Delaware", FL = "Florida", GA = "Georgia", ) def nid(alist): """ nr distinct ids """ return "%d ids %d pickle len" % ( len( set( map( id, alist ))), len( cPickle.dumps( alist, 0 ))) # rough est ? # cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents N = 10000 exec( "\n".join( sys.argv[1:] )) # var=val ... random.seed(1) # big list of random names of states -- names = [] for j in xrange(N): name = copy( random.choice( states.values() )) names.append(name) print "%d strings in mem: %s" % (N, nid(names) ) # 10 ids, even with copy() # list to a file, back again -- each string is allocated anew joinsplit = "\n".join(names).split() # same as &gt; file &gt; mem again assert joinsplit == names print "%d strings from a file: %s" % (N, nid(joinsplit) ) # 10000 strings in mem: 10 ids 42149 pickle len # 10000 strings from a file: 10000 ids 188080 pickle len # Python 2.6.4 mac ppc </code></pre> <p>Added 25jan:<br> There are two kinds of strings in Python memory (or any program's):</p> <ul> <li>Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache</li> <li>Ostrings, the others, which may be stored any number of times.</li> </ul> <p><code>intern(astring)</code> puts astring in the Ucache (Alex +1); other than that we know nothing at all about how Python moves Ostrings to the Ucache -- how did "a"+"b" get in, after "ab" ? ("Strings from files" is meaningless -- there's no way of knowing.)<br> In short, Ucaches (there may be several) remain murky.</p> <p>A historical footnote: <a href="http://en.wikipedia.org/wiki/SPITBOL_compiler" rel="noreferrer">SPITBOL</a> uniquified all strings ca. 1970.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload