StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython very large set. How to avoid out of memory exception?
primarykey
Id
15555728
data
AcceptedAnswerId
22589939
AnswerCount
3
ClosedDate
CommentCount
7
CommunityOwnedDate
CreationDate
2013-03-21T18:48:19.980
FavoriteCount
0
LastActivityDate
2015-01-18T14:51:16.003
LastEditDate
2013-03-21T19:19:23.747
LastEditorUserId
118657
OwnerUserId
118657
ParentId
0
PostTypeId
1
Score
6
ViewCount
14321
LastEditorDisplayName
text
Body
I use a Python set collection to store unique objects. Every object has <code>__hash__</code> and <code>__eq__</code> overridden. The set contains near 200 000 objects. The set itself takes near 4 GB of memory. It works fine on machine with more than 5 GB, but now I got a need to run the script on a machine that has only 3 GB of RAM available. I rewrote a script to C# - actually read the same data from the same source, put it to a CLR analogue of set (HashSet) and instead of 4 GB it took near 350 MB while the speed of the script execution was relatively the same (near 40 seconds) But I have to use Python. Q1: Does Python have any "disk persistent" set or any other workaround? I guess that it can store in memory only "key" data used in hash/eq methods and everything else can be persisted to disk. Or maybe there are other workarounds in Python to have a unique collection of objects that may take more memory than available in the system. Q2: less practical question: why does python set takes so much more memory for a set? I use standard Python 2.7.3 on 64 bit Ubuntu 12.10 Thank you. Update1: What script does: <ol> <li>Read a lot of semi-structured JSON documents (each JSON consist of serialized object with collection of aggregated objects related to it)</li> <li>Parse each JSON document to retrieve from it the main object and the objects from aggregated collections. Every parsed object is stored to a set. Set is used to store unique objects only. Firstly I used a database, but unique constraint in database works x100-x1000 slower. Every JSON document is parsed to 1-8 different object types. Each object type is stored in it's own set to save in memory only unique objects.</li> <li>All data stored in sets is saved to relational database with unique constraints. Each set is stored in separate database table.</li> </ol> The whole idea of the script to take unstructured data, remove duplicates from aggregated object collections in the JSON document and store the structured data to relational database. Update 2: 2 delnan: I commented all lines of code with adding to a different sets keeping all other staff (getting data, parsing, iterating) the same - The script took 4 GB less memory. It means that when those 200K objects are added to sets - they start taking so much memory. The object is a simple movie data from TMDB - ID, a list of genres, a list of actors, directors, a lot of other movie details and possibly large movie description from Wikipedia.
Tags
<python><python-2.7>
Title
Python very large set. How to avoid out of memory exception?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USZelid
UserOwnerUserId
1. USZelid
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPython very large set. How to avoid out of memory exception?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POPython very large set. How to avoid out of memory exception?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.