Note that there are some explanatory texts on larger screens.

plurals
  1. POOptimizing python script extracting and processing large data files
    primarykey
    data
    text
    <p>I am new to python and naively wrote a python script for the following task:</p> <p>I want to create a bag of words representation of multiple objects. Each object is basically a pair and bag of words representation of synopsis is to be made. So the object is converted to in the final documents.</p> <p>Here is the script:</p> <pre><code>import re import math import itertools from nltk.corpus import stopwords from nltk import PorterStemmer from collections import defaultdict from collections import Counter from itertools import dropwhile import sys, getopt inp = "inp_6000.txt" #input file name out = "bowfilter10" #output file name with open(inp,'r') as plot_data: main_dict = Counter() file1, file2 = itertools.tee(plot_data, 2) line_one = itertools.islice(file1, 0, None, 4) line_two = itertools.islice(file2, 2, None, 4) dictionary = defaultdict(Counter) doc_count = defaultdict(Counter) for movie_name, movie_plot in itertools.izip(line_one, line_two): movie_plot = movie_plot.lower() words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE) #split words elemStopW = filter(lambda x: x not in stopwords.words('english'), words) #remove stop words, python nltk for word in elemStopW: word = PorterStemmer().stem_word(word) #use python stemmer class to do stemming #increment the word count of the movie in the particular movie synopsis dictionary[movie_name][word] += 1 #increment the count of a partiular word in main dictionary which stores frequency of all documents. main_dict[word] += 1 #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other. if doc_count[word]['this_mov']==0: doc_count[word].update(count=1, this_mov=1); for word in doc_count: doc_count[word].update(this_mov=-1) #print "---------main_dict---------" #print main_dict #Remove all the words with frequency less than 5 in whole set of movies for key, count in dropwhile(lambda key_count: key_count[1] &gt;= 5, main_dict.most_common()): del main_dict[key] #print main_dict .#Write to file bow_vec = open(out, 'w'); #calculate the the bog vector and write it m = len(dictionary) for movie_name in dictionary.keys(): #print movie_name vector = [] for word in list(main_dict): #print word, dictionary[movie_name][word] x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2) vector.append(x) #write to file bow_vec.write("%s" % movie_name) for item in vector: bow_vec.write("%s," % item) bow_vec.write("\n") </code></pre> <p>Format of the data file and additional information about data: The data file has the following format:</p> <p>Movie Name. Empty line. Movie synopsis(on can assume the size to be around 150 words) Empty line. </p> <p>Note:<code>&lt;*&gt;</code> are meant for representation.</p> <p>Size of input the file:<br> The file size is around 200 MB.</p> <p>As of now this script is taking around <strong>10-12hrs</strong> on a 3 GHz Intel processor.</p> <blockquote> <p>Note: I am looking for improvement in serial code. I know parallelization would improve it but I want look into it later. I want to take this opportunity to make this serial code more efficient. </p> </blockquote> <p>Any help appreciated.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload