StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython MemoryError - Is there a more efficient way of working with my huge CSV file?
primarykey
Id
17473186
data
AcceptedAnswerId
17474606
AnswerCount
2
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2013-07-04T14:47:56.720
FavoriteCount
1
LastActivityDate
2013-07-05T17:34:36.067
LastEditDate
2013-07-04T15:49:07.460
LastEditorUserId
2445114
OwnerUserId
2445114
ParentId
0
PostTypeId
1
Score
0
ViewCount
3129
LastEditorDisplayName
text
Body
[Using Python3.3] I have one huge CSV file that contains XX million rows and include a couple of columns. I want to read that file, add a couple of calculated columns and spit out a couple of 'segmented' csv-files. I've tried a smaller test file on the following code, and it does exactly what I wanted it to do. But now I'm loading the original CSV file (which is about 3.2 GB) and I get a memory error. Is there a more memory efficient way of writing the below code? Please note that I'm very new to Python thus there are probably lots of stuff I am not totally aware of. Example input data: <pre><code>email cc nr_of_transactions last_transaction_date timebucket total_basket email1@email.com us 2 datetime value 1 20.29 email2@email.com gb 3 datetime value 2 50.84 email3@email.com ca 5 datetime value 3 119.12 ... ... ... ... ... ... </code></pre> This is my code: <pre><code>import csv import scipy.stats as stats import itertools from operator import itemgetter def add_rankperc(filename): ''' Function that calculates percentile rank of total basket value of a user (i.e. email) within a country. Next, it assigns the user to a rankbucket based on its percentile rank, using the following rules: Percentage rank between 75 and 100 -> top25 Percentage rank between 25 and 74 -> mid50 Percentage rank between 0 and 24 -> bottom25 ''' # Defining headers for ease of use/DictReader headers = ['email', 'cc', 'nr_transactions', 'last_transaction_date', 'timebucket', 'total_basket'] groups = [] with open(filename, encoding='utf-8', mode='r') as f_in: # Input file is tab-separated, hence dialect='excel-tab' r = csv.DictReader(f_in, dialect='excel-tab', fieldnames=headers) # DictReader reads all dict values as strings, converting total_basket to a float dict_list = [] for row in r: row['total_basket'] = float(row['total_basket']) # Append row to a list (of dictionaries) for further processing dict_list.append(row) # Groupby function on cc and total_basket for key, group in itertools.groupby(sorted(dict_list, key=itemgetter('cc', 'total_basket')), key=itemgetter('cc')): rows = list(group) for row in rows: # Calculates the percentile rank for each value for each country row['rankperc'] = stats.percentileofscore([row['total_basket'] for row in rows], row['total_basket']) # Percentage rank between 75 and 100 -> top25 if 75 <= row['rankperc'] <= 100: row['rankbucket'] = 'top25' # Percentage rank between 25 and 74 -> mid50 elif 25 <= row['rankperc'] < 75: row['rankbucket'] = 'mid50' # Percentage rank between 0 and 24 -> bottom25 else: row['rankbucket'] = 'bottom25' # Appending all rows to a list to be able to return it and use it in another function groups.append(row) return groups def filter_n_write(data): ''' Function takes input data, groups by specified keys and outputs only the e-mail addresses to csv files as per the respective grouping. ''' # Creating group iterator based on keys for key, group in itertools.groupby(sorted(data, key=itemgetter('timebucket', 'rankbucket')), key=itemgetter('timebucket', 'rankbucket')): # List comprehension to create a list of lists of email addresses. One row corresponds to the respective combination of grouping keys. emails = list([row['email'] for row in group]) # Dynamically naming output file based on grouping keys f_out = 'output-{}-{}.csv'.format(key[0], key[1]) with open(f_out, encoding='utf-8', mode='w') as fout: w = csv.writer(fout, dialect='excel', lineterminator='\n') # Writerows using list comprehension to write each email in emails iterator (i.e. one address per row). Wrapping email in brackets to write full address in one cell. w.writerows([email] for email in emails) filter_n_write(add_rankperc('infile.tsv')) </code></pre> Thanks in advance!
Tags
<python><memory><csv><python-3.x>
Title
Python MemoryError - Is there a more efficient way of working with my huge CSV file?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USMatthijs
UserOwnerUserId
1. USMatthijs
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPython MemoryError - Is there a more efficient way of working with my huge CSV file?
 UserUserId
 USunutbu
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.