Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>This is why your OS has multiprocessing pipelines.</p> <pre><code>collapse.py sometweetfile | filter.py | user_id.py | user_split.py -d some_directory </code></pre> <p>collapse.py</p> <pre><code>import sys with open("source","r") as theFile: tweet = {} for line in theFile: rec_type, content = line.split('\t') if rec_type in tweet: t, u, w = tweet.get('T',''), tweet.get('U',''), tweet.get('W','') result= "{0}\t{1}\t{2}".format( t, u, w ) sys.stdout.write( result ) tweet= {} tweet[rec_type]= content t, u, w = tweet.get('T',''), tweet.get('U',''), tweet.get('W','') result= "{0}\t{1}\t{2}".format( t, u, w ) sys.stdout.write( result ) </code></pre> <p>filter.py</p> <pre><code>import sys for tweet in sys.stdin: t, u, w = tweet.split('\t') if 'No Post Title' in t: continue sys.stdout.write( tweet ) </code></pre> <p>user_id.py</p> <pre><code>import sys import urllib for tweet in sys.stdin: t, u, w = tweet.split('\t') path=urlparse(w).path.strip('/') result= "{0}\t{1}\t{2}\t{3}".format( t, u, w, path ) sys.stdout.write( result ) </code></pre> <p>user_split.py</p> <pre><code>users = {} for tweet in sys.stdin: t, u, w, user = tweet.split('\t') if user not in users: # May run afoul of open file limits... users[user]= open(some_directory+user,"w") users[user].write( tweet ) users[user].flush( tweet ) for u in users: users[u].close() </code></pre> <p>Wow, you say. What a lot of code.</p> <p>Yes. But. It spreads out among ALL the processing cores you own and it all runs concurrently. Also, when you connect stdout to stdin through a pipe, it's really only a shared buffer: there's no physical I/O occurring.</p> <p>It's amazingly fast to do things this way. That's why the <strong>*Nix</strong> operating systems work that way. This is what you need to do for real speed.</p> <hr> <p>The LRU algorithm, FWIW.</p> <pre><code> if user not in users: # Only keep a limited number of files open if len(users) &gt; 64: # or whatever your OS limit is. lru, aFile, u = min( users.values() ) aFile.close() users.pop(u) users[user]= [ tolu, open(some_directory+user,"w"), user ] tolu += 1 users[user][1].write( tweet ) users[user][1].flush() # may not be necessary users[user][0]= tolu </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload