Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>make sure you have 0.11, read these docs: <a href="http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables" rel="nofollow">http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables</a>, and these recipes: <a href="http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore" rel="nofollow">http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore</a> (esp the 'merging on millions of rows'</p> <p>Here is a solution that seems to work. Here is the workflow:</p> <p>1) read data from your csv by chunks and appending to an hdfstore 2) an iteration over the store, which creates another store that does the combiner</p> <p>Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost.</p> <p>This should be O(num_of_chunks**2) memory and calculation time chunksize could be say 1m (or more) in your case</p> <pre><code>processing [0] [datastore.h5] processing [1] [datastore_0.h5] count date diff email 4 1 2011-06-24 00:00:00 0 0000.ANU@GMAIL.COM 1 1 2011-06-24 00:00:00 0 00000.POO@GMAIL.COM 0 1 2010-07-26 00:00:00 0 00000000@11111.COM 2 1 2013-01-01 00:00:00 0 0000650000@YAHOO.COM 3 1 2013-01-26 00:00:00 0 00009.GAURAV@GMAIL.COM 5 1 2011-10-29 00:00:00 0 0000MANNU@GMAIL.COM 6 1 2011-11-21 00:00:00 0 0000PRANNOY0000@GMAIL.COM 7 1 2011-06-26 00:00:00 0 0000PRANNOY0000@YAHOO.CO.IN 8 1 2012-10-25 00:00:00 0 0000RAHUL@GMAIL.COM 9 1 2011-05-10 00:00:00 0 0000SS0@GMAIL.COM 12 1 2010-12-09 00:00:00 0 0001HARISH@GMAIL.COM 11 2 2010-12-12 00:00:00 3 0001HARISH@GMAIL.COM 10 3 2010-12-22 00:00:00 13 0001HARISH@GMAIL.COM 14 1 2012-11-28 00:00:00 0 000AYUSH@GMAIL.COM 15 2 2012-11-29 00:00:00 1 000AYUSH@GMAIL.COM 17 3 2012-12-08 00:00:00 10 000AYUSH@GMAIL.COM 18 4 2012-12-12 00:00:00 14 000AYUSH@GMAIL.COM 13 5 2013-01-25 00:00:00 58 000AYUSH@GMAIL.COM import pandas as pd import StringIO import numpy as np from time import strptime from datetime import datetime # your data data = """ "DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2" "Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025" "DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792" "Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800" "Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595" "Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957" "Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212" "DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080" "Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731" "DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000" "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006" "DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000" "DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439" "Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136" "Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41" "Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96" "Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96" "Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96" "Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96" """ # read in and create the store data_store_file = 'datastore.h5' store = pd.HDFStore(data_store_file,'w') def dp(x, **kwargs): return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ] chunksize=5 reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'], header=0,usecols=['email','date'],parse_dates=['date'], date_parser=dp, chunksize=chunksize) for i, chunk in enumerate(reader): chunk['indexer'] = chunk.index + i*chunksize # create the global index, and keep it in the frame too df = chunk.set_index('indexer') # need to set a minimum size for the email column store.append('data',df,min_itemsize={'email' : 100}) store.close() # define the combiner function def combiner(x): # given a group of emails (the same), return a combination # with the new data # sort by the date y = x.sort('date') # calc the diff in days (an integer) y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days)) y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64') return y # reduce the store (and create a new one by chunks) in_store_file = data_store_file in_store1 = pd.HDFStore(in_store_file) # iter on the store 1 for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)): print "processing [%s] [%s]" % (chunki,in_store_file) out_store_file = 'datastore_%s.h5' % chunki out_store = pd.HDFStore(out_store_file,'w') # iter on store 2 in_store2 = pd.HDFStore(in_store_file) for df2 in in_store2.select('data',chunksize=chunksize): # concat &amp; drop dups df = pd.concat([df1,df2]).drop_duplicates(['email','date']) # group and combine result = df.groupby('email').apply(combiner) # remove the mi (that we created in the groupby) result = result.reset_index('email',drop=True) # only store those rows which are in df2! result = result.reindex(index=df2.index).dropna() # store to the out_store out_store.append('data',result,min_itemsize={'email' : 100}) in_store2.close() out_store.close() in_store_file = out_store_file in_store1.close() # show the reduced store print pd.read_hdf(out_store_file,'data').sort(['email','diff']) </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload