Note that there are some explanatory texts on larger screens.

plurals
  1. POPython MemoryError - Is there a more efficient way of working with my huge CSV file?
    primarykey
    data
    text
    <p>[Using Python3.3] I have one huge CSV file that contains XX million rows and include a couple of columns. I want to read that file, add a couple of calculated columns and spit out a couple of 'segmented' csv-files. I've tried a smaller test file on the following code, and it does exactly what I wanted it to do. But now I'm loading the original CSV file (which is about 3.2 GB) and I get a memory error. Is there a more memory efficient way of writing the below code?</p> <p>Please note that I'm very new to Python thus there are probably lots of stuff I am not totally aware of.</p> <p>Example input data:</p> <pre><code>email cc nr_of_transactions last_transaction_date timebucket total_basket email1@email.com us 2 datetime value 1 20.29 email2@email.com gb 3 datetime value 2 50.84 email3@email.com ca 5 datetime value 3 119.12 ... ... ... ... ... ... </code></pre> <p>This is my code:</p> <pre><code>import csv import scipy.stats as stats import itertools from operator import itemgetter def add_rankperc(filename): ''' Function that calculates percentile rank of total basket value of a user (i.e. email) within a country. Next, it assigns the user to a rankbucket based on its percentile rank, using the following rules: Percentage rank between 75 and 100 -&gt; top25 Percentage rank between 25 and 74 -&gt; mid50 Percentage rank between 0 and 24 -&gt; bottom25 ''' # Defining headers for ease of use/DictReader headers = ['email', 'cc', 'nr_transactions', 'last_transaction_date', 'timebucket', 'total_basket'] groups = [] with open(filename, encoding='utf-8', mode='r') as f_in: # Input file is tab-separated, hence dialect='excel-tab' r = csv.DictReader(f_in, dialect='excel-tab', fieldnames=headers) # DictReader reads all dict values as strings, converting total_basket to a float dict_list = [] for row in r: row['total_basket'] = float(row['total_basket']) # Append row to a list (of dictionaries) for further processing dict_list.append(row) # Groupby function on cc and total_basket for key, group in itertools.groupby(sorted(dict_list, key=itemgetter('cc', 'total_basket')), key=itemgetter('cc')): rows = list(group) for row in rows: # Calculates the percentile rank for each value for each country row['rankperc'] = stats.percentileofscore([row['total_basket'] for row in rows], row['total_basket']) # Percentage rank between 75 and 100 -&gt; top25 if 75 &lt;= row['rankperc'] &lt;= 100: row['rankbucket'] = 'top25' # Percentage rank between 25 and 74 -&gt; mid50 elif 25 &lt;= row['rankperc'] &lt; 75: row['rankbucket'] = 'mid50' # Percentage rank between 0 and 24 -&gt; bottom25 else: row['rankbucket'] = 'bottom25' # Appending all rows to a list to be able to return it and use it in another function groups.append(row) return groups def filter_n_write(data): ''' Function takes input data, groups by specified keys and outputs only the e-mail addresses to csv files as per the respective grouping. ''' # Creating group iterator based on keys for key, group in itertools.groupby(sorted(data, key=itemgetter('timebucket', 'rankbucket')), key=itemgetter('timebucket', 'rankbucket')): # List comprehension to create a list of lists of email addresses. One row corresponds to the respective combination of grouping keys. emails = list([row['email'] for row in group]) # Dynamically naming output file based on grouping keys f_out = 'output-{}-{}.csv'.format(key[0], key[1]) with open(f_out, encoding='utf-8', mode='w') as fout: w = csv.writer(fout, dialect='excel', lineterminator='\n') # Writerows using list comprehension to write each email in emails iterator (i.e. one address per row). Wrapping email in brackets to write full address in one cell. w.writerows([email] for email in emails) filter_n_write(add_rankperc('infile.tsv')) </code></pre> <p>Thanks in advance!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload