StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POOptimizing python script extracting and processing large data files
primarykey
Id
15905075
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2013-04-09T14:45:48.637
FavoriteCount
2
LastActivityDate
2015-04-18T16:44:13.797
LastEditDate
2015-04-18T16:44:13.797
LastEditorUserId
319204
OwnerUserId
1317240
ParentId
0
PostTypeId
1
Score
0
ViewCount
1190
LastEditorDisplayName
text
Body
I am new to python and naively wrote a python script for the following task: I want to create a bag of words representation of multiple objects. Each object is basically a pair and bag of words representation of synopsis is to be made. So the object is converted to in the final documents. Here is the script: <pre><code>import re import math import itertools from nltk.corpus import stopwords from nltk import PorterStemmer from collections import defaultdict from collections import Counter from itertools import dropwhile import sys, getopt inp = "inp_6000.txt" #input file name out = "bowfilter10" #output file name with open(inp,'r') as plot_data: main_dict = Counter() file1, file2 = itertools.tee(plot_data, 2) line_one = itertools.islice(file1, 0, None, 4) line_two = itertools.islice(file2, 2, None, 4) dictionary = defaultdict(Counter) doc_count = defaultdict(Counter) for movie_name, movie_plot in itertools.izip(line_one, line_two): movie_plot = movie_plot.lower() words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE) #split words elemStopW = filter(lambda x: x not in stopwords.words('english'), words) #remove stop words, python nltk for word in elemStopW: word = PorterStemmer().stem_word(word) #use python stemmer class to do stemming #increment the word count of the movie in the particular movie synopsis dictionary[movie_name][word] += 1 #increment the count of a partiular word in main dictionary which stores frequency of all documents. main_dict[word] += 1 #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other. if doc_count[word]['this_mov']==0: doc_count[word].update(count=1, this_mov=1); for word in doc_count: doc_count[word].update(this_mov=-1) #print "---------main_dict---------" #print main_dict #Remove all the words with frequency less than 5 in whole set of movies for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()): del main_dict[key] #print main_dict .#Write to file bow_vec = open(out, 'w'); #calculate the the bog vector and write it m = len(dictionary) for movie_name in dictionary.keys(): #print movie_name vector = [] for word in list(main_dict): #print word, dictionary[movie_name][word] x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2) vector.append(x) #write to file bow_vec.write("%s" % movie_name) for item in vector: bow_vec.write("%s," % item) bow_vec.write("\n") </code></pre> Format of the data file and additional information about data: The data file has the following format: Movie Name. Empty line. Movie synopsis(on can assume the size to be around 150 words) Empty line. Note:<code><*></code> are meant for representation. Size of input the file: The file size is around 200 MB. As of now this script is taking around 10-12hrs on a 3 GHz Intel processor. <blockquote> Note: I am looking for improvement in serial code. I know parallelization would improve it but I want look into it later. I want to take this opportunity to make this serial code more efficient. </blockquote> Any help appreciated.
Tags
<python><nltk>
Title
Optimizing python script extracting and processing large data files
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USTheCodeArtist
UserOwnerUserId
1. USAman Deep Gautam
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.