Note that there are some explanatory texts on larger screens.

plurals
  1. POPython very large set. How to avoid out of memory exception?
    primarykey
    data
    text
    <p>I use a Python set collection to store unique objects. Every object has <code>__hash__</code> and <code>__eq__</code> overridden.</p> <p>The set contains near 200 000 objects. The set itself takes near 4 GB of memory. It works fine on machine with more than 5 GB, but now I got a need to run the script on a machine that has only 3 GB of RAM available.</p> <p>I rewrote a script to C# - actually read the same data from the same source, put it to a CLR analogue of set (HashSet) and instead of 4 GB it took near 350 MB while the speed of the script execution was relatively the same (near 40 seconds) But I have to use Python.</p> <p>Q1: Does Python have any "disk persistent" set or any other workaround? I guess that it can store in memory only "key" data used in hash/eq methods and everything else can be persisted to disk. Or maybe there are other workarounds in Python to have a unique collection of objects that may take more memory than available in the system.</p> <p>Q2: less practical question: why does python set takes so much more memory for a set?</p> <p>I use standard Python 2.7.3 on 64 bit Ubuntu 12.10</p> <p>Thank you.</p> <p>Update1: What script does:</p> <ol> <li><p>Read a lot of semi-structured JSON documents (each JSON consist of serialized object with collection of aggregated objects related to it)</p></li> <li><p>Parse each JSON document to retrieve from it the main object and the objects from aggregated collections. Every parsed object is stored to a set. Set is used to store unique objects only. Firstly I used a database, but unique constraint in database works x100-x1000 slower. Every JSON document is parsed to 1-8 different object types. Each object type is stored in it's own set to save in memory only unique objects.</p></li> <li><p>All data stored in sets is saved to relational database with unique constraints. Each set is stored in separate database table.</p></li> </ol> <p>The whole idea of the script to take unstructured data, remove duplicates from aggregated object collections in the JSON document and store the structured data to relational database.</p> <p>Update 2:</p> <p>2 delnan: I commented all lines of code with adding to a different sets keeping all other staff (getting data, parsing, iterating) the same - The script took 4 GB less memory. </p> <p>It means that when those 200K objects are added to sets - they start taking so much memory. The object is a simple movie data from TMDB - ID, a list of genres, a list of actors, directors, a lot of other movie details and possibly large movie description from Wikipedia.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload