StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
16110391
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-04-19T17:34:29.773
FavoriteCount
0
LastActivityDate
2013-04-20T16:47:26.980
LastEditDate
2013-04-20T16:47:26.980
LastEditorUserId
644898
OwnerUserId
644898
ParentId
16110252
PostTypeId
2
Score
7
ViewCount
0
LastEditorDisplayName
text
Body
make sure you have 0.11, read these docs: <a href="http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables" rel="nofollow">http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables</a>, and these recipes: <a href="http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore" rel="nofollow">http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore</a> (esp the 'merging on millions of rows' Here is a solution that seems to work. Here is the workflow: 1) read data from your csv by chunks and appending to an hdfstore 2) an iteration over the store, which creates another store that does the combiner Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost. This should be O(num_of_chunks**2) memory and calculation time chunksize could be say 1m (or more) in your case <pre><code>processing [0] [datastore.h5] processing [1] [datastore_0.h5] count date diff email 4 1 2011-06-24 00:00:00 0 0000.ANU@GMAIL.COM 1 1 2011-06-24 00:00:00 0 00000.POO@GMAIL.COM 0 1 2010-07-26 00:00:00 0 00000000@11111.COM 2 1 2013-01-01 00:00:00 0 0000650000@YAHOO.COM 3 1 2013-01-26 00:00:00 0 00009.GAURAV@GMAIL.COM 5 1 2011-10-29 00:00:00 0 0000MANNU@GMAIL.COM 6 1 2011-11-21 00:00:00 0 0000PRANNOY0000@GMAIL.COM 7 1 2011-06-26 00:00:00 0 0000PRANNOY0000@YAHOO.CO.IN 8 1 2012-10-25 00:00:00 0 0000RAHUL@GMAIL.COM 9 1 2011-05-10 00:00:00 0 0000SS0@GMAIL.COM 12 1 2010-12-09 00:00:00 0 0001HARISH@GMAIL.COM 11 2 2010-12-12 00:00:00 3 0001HARISH@GMAIL.COM 10 3 2010-12-22 00:00:00 13 0001HARISH@GMAIL.COM 14 1 2012-11-28 00:00:00 0 000AYUSH@GMAIL.COM 15 2 2012-11-29 00:00:00 1 000AYUSH@GMAIL.COM 17 3 2012-12-08 00:00:00 10 000AYUSH@GMAIL.COM 18 4 2012-12-12 00:00:00 14 000AYUSH@GMAIL.COM 13 5 2013-01-25 00:00:00 58 000AYUSH@GMAIL.COM import pandas as pd import StringIO import numpy as np from time import strptime from datetime import datetime # your data data = """ "DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2" "Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025" "DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792" "Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800" "Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595" "Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957" "Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212" "DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080" "Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731" "DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000" "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006" "DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000" "DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439" "Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136" "Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41" "Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96" "Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96" "Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96" "Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96" """ # read in and create the store data_store_file = 'datastore.h5' store = pd.HDFStore(data_store_file,'w') def dp(x, **kwargs): return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ] chunksize=5 reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'], header=0,usecols=['email','date'],parse_dates=['date'], date_parser=dp, chunksize=chunksize) for i, chunk in enumerate(reader): chunk['indexer'] = chunk.index + i*chunksize # create the global index, and keep it in the frame too df = chunk.set_index('indexer') # need to set a minimum size for the email column store.append('data',df,min_itemsize={'email' : 100}) store.close() # define the combiner function def combiner(x): # given a group of emails (the same), return a combination # with the new data # sort by the date y = x.sort('date') # calc the diff in days (an integer) y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days)) y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64') return y # reduce the store (and create a new one by chunks) in_store_file = data_store_file in_store1 = pd.HDFStore(in_store_file) # iter on the store 1 for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)): print "processing [%s] [%s]" % (chunki,in_store_file) out_store_file = 'datastore_%s.h5' % chunki out_store = pd.HDFStore(out_store_file,'w') # iter on store 2 in_store2 = pd.HDFStore(in_store_file) for df2 in in_store2.select('data',chunksize=chunksize): # concat & drop dups df = pd.concat([df1,df2]).drop_duplicates(['email','date']) # group and combine result = df.groupby('email').apply(combiner) # remove the mi (that we created in the groupby) result = result.reset_index('email',drop=True) # only store those rows which are in df2! result = result.reindex(index=df2.index).dropna() # store to the out_store out_store.append('data',result,min_itemsize={'email' : 100}) in_store2.close() out_store.close() in_store_file = out_store_file in_store1.close() # show the reduced store print pd.read_hdf(out_store_file,'data').sort(['email','diff']) </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. PONeed to compare very large files around 1.5GB in python
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USJeff
UserOwnerUserId
1. USJeff
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COMy first thought was also to put data in database but it wont do the trick i need to track 1st occurence of each email address also and count too and what about getting the diff of dates whats the soln for that?? @Jeff My file has more than 20million rows and unique values are turned out to be more than 5million if i start comparing those i will got complexity of 5million*20million which can take months to solve the problem how can i reduce comlexity from n*n so that i can handle such large amounts of volume
 singulars
 PostPostId
 PO
 UserUserId
 USGeek
2. COso the first time an email appears its email is then the reference date for subsequent appearances? for the 2nd email it's easy, count is 1 and days is diff of days, what about 3rd email. does days get updated to be the diff between 3rd date and 1st date or is the number of days somehow involved (maybe the 3rd days is max of current and 3rd date - reference date?)
 singulars
 PostPostId
 PO
 UserUserId
 USJeff

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.