StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POProcessing a huge file (9.1GB) and processing it faster -- Python
primarykey
Id
4667434
data
AcceptedAnswerId
4668663
AnswerCount
7
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2011-01-12T10:01:57.923
FavoriteCount
10
LastActivityDate
2017-10-04T16:44:47.080
LastEditDate
2011-01-14T08:46:40.543
LastEditorUserId
443380
OwnerUserId
443380
ParentId
0
PostTypeId
1
Score
11
ViewCount
6520
LastEditorDisplayName
text
Body
I have a 9GB text file of tweets in the following format: <pre><code>T 'time and date' U 'name of user in the form of a URL' W Actual tweet </code></pre> There are in total 6,000,000 users and more than 60,000,000 tweets. I read 3 lines at a time using itertools.izip() and then according to the name, write it into a file. But its taking way too long (26 hours and counting). How can this be made faster? Posting code for completeness, <pre><code>s='the existing folder which will have all the files' with open('path to file') as f: for line1,line2,line3 in itertools.izip_longest(*[f]*3): if(line1!='\n' and line2!='\n' and line3!='\n'): line1=line1.split('\t') line2=line2.split('\t') line3=line3.split('\t') if(not(re.search(r'No Post Title',line1[1]))): url=urlparse(line3[1].strip('\n')).path.strip('/') if(url==''): file=open(s+'junk','a') file.write(line1[1]) file.close() else: file=open(s+url,'a') file.write(line1[1]) file.close() </code></pre> My aim is to use topic modeling on the small texts (as in, running lda on all the tweets of one user, thus requiring a separate file for each user), but its taking way too much time. UPDATE: I used the suggestions by user S.Lott and used the following code : <pre><code>import re from urlparse import urlparse import os def getUser(result): result=result.split('\n') u,w=result[0],result[1] path=urlparse(u).path.strip('/') if(path==''): f=open('path to junk','a') f.write('its Junk !!') f.close() else: result="{0}\n{1}\n{2}\n".format(u,w,path) writeIntoFile(result) def writeIntoFile(result): tweet=result.split('\n') users = {} directory='path to directory' u, w, user = tweet[0],tweet[1],tweet[2] if user not in users : if(os.path.isfile(some_directory+user)): if(len(users)>64): lru,aFile,u=min(users.values()) aFile.close() users.pop(u) users[user]=open(some_directory+user,'a') users[user].write(w+'\n') #users[user].flush elif (not(os.path.isfile(some_directory+user))): if len(users)>64: lru,aFile,u=min(users.values()) aFile.close() users.pop(u) users[user]=open(some_directory+user,'w') users[user].write(w+'\n') for u in users: users[u].close() import sys s=open(sys.argv[1],'r') tweet={} for l in s: r_type,content=l.split('\t') if r_type in tweet: u,w=tweet.get('U',''),tweet.get('W','') if(not(re.search(r'No Post Title',u))): result="{0}{1}".format(u,w) getUser(result) tweet={} tweet[r_type]=content </code></pre> Obviously, it is pretty much a mirror of what he suggested and kindly shared too. Initially the speed was very fast but then it has got slower . I have posted the updated code so that i can get some more suggestions on how it could have been made faster. If i was reading from sys.stdin, then there was an import error which could not be resolved by me. So, to save time and get on with it, i simply used this , hoping that it works and does so correctly. Thanks.
Tags
<python><performance>
Title
Processing a huge file (9.1GB) and processing it faster -- Python
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. UScrazyaboutliv
UserOwnerUserId
1. UScrazyaboutliv
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POProcessing a huge file (9.1GB) and processing it faster -- Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POProcessing a huge file (9.1GB) and processing it faster -- Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POProcessing a huge file (9.1GB) and processing it faster -- Python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.