Note that there are some explanatory texts on larger screens.

plurals
  1. POProcessing a huge file (9.1GB) and processing it faster -- Python
    primarykey
    data
    text
    <p>I have a 9GB text file of tweets in the following format: </p> <pre><code>T 'time and date' U 'name of user in the form of a URL' W Actual tweet </code></pre> <p>There are in total 6,000,000 users and more than 60,000,000 tweets. I read 3 lines at a time using itertools.izip() and then according to the name, write it into a file. But its taking way too long (26 hours and counting). How can this be made faster? </p> <p>Posting code for completeness,</p> <pre><code>s='the existing folder which will have all the files' with open('path to file') as f: for line1,line2,line3 in itertools.izip_longest(*[f]*3): if(line1!='\n' and line2!='\n' and line3!='\n'): line1=line1.split('\t') line2=line2.split('\t') line3=line3.split('\t') if(not(re.search(r'No Post Title',line1[1]))): url=urlparse(line3[1].strip('\n')).path.strip('/') if(url==''): file=open(s+'junk','a') file.write(line1[1]) file.close() else: file=open(s+url,'a') file.write(line1[1]) file.close() </code></pre> <p>My aim is to use topic modeling on the small texts (as in, running lda on all the tweets of one user, thus requiring a separate file for each user), but its taking way too much time. </p> <p><strong>UPDATE</strong>: I used the suggestions by user S.Lott and used the following code : </p> <pre><code>import re from urlparse import urlparse import os def getUser(result): result=result.split('\n') u,w=result[0],result[1] path=urlparse(u).path.strip('/') if(path==''): f=open('path to junk','a') f.write('its Junk !!') f.close() else: result="{0}\n{1}\n{2}\n".format(u,w,path) writeIntoFile(result) def writeIntoFile(result): tweet=result.split('\n') users = {} directory='path to directory' u, w, user = tweet[0],tweet[1],tweet[2] if user not in users : if(os.path.isfile(some_directory+user)): if(len(users)&gt;64): lru,aFile,u=min(users.values()) aFile.close() users.pop(u) users[user]=open(some_directory+user,'a') users[user].write(w+'\n') #users[user].flush elif (not(os.path.isfile(some_directory+user))): if len(users)&gt;64: lru,aFile,u=min(users.values()) aFile.close() users.pop(u) users[user]=open(some_directory+user,'w') users[user].write(w+'\n') for u in users: users[u].close() import sys s=open(sys.argv[1],'r') tweet={} for l in s: r_type,content=l.split('\t') if r_type in tweet: u,w=tweet.get('U',''),tweet.get('W','') if(not(re.search(r'No Post Title',u))): result="{0}{1}".format(u,w) getUser(result) tweet={} tweet[r_type]=content </code></pre> <p>Obviously, it is pretty much a mirror of what he suggested and kindly shared too. Initially the speed was very fast but then it has got slower . I have posted the updated code so that i can get some more suggestions on how it could have been made faster. If i was reading from sys.stdin, then there was an import error which could not be resolved by me. So, to save time and get on with it, i simply used this , hoping that it works and does so correctly. Thanks.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload