Note that there are some explanatory texts on larger screens.

plurals
  1. POCounting bi-gram frequencies
    primarykey
    data
    text
    <p>I've written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I'd like to alter it so that it can count bi-gram frequencies, i.e. pairs of words instead of single words although my attempts have proved unsuccessful at best.</p> <p>I realise there's alot to look at but any help on this is greatly appreciated. Here's my code:</p> <pre><code> import re import nltk # Quran subset filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ') # create list of lower case words word_list = re.split('\s+', file(filename).read().lower()) print 'Words in text:', len(word_list) # punctuation and numbers to be removed punctuation = re.compile(r'[-.?!,":;()|0-9]') word_list = [punctuation.sub("", word) for word in word_list] word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')] # create dictionary of word:frequency pairs freq_dic = {} for word in word_list2: # form dictionary try: freq_dic[word] += 1 except: freq_dic[word] = 1 print '-'*30 print "sorted by highest frequency first:" # create list of (val, key) tuple pairs freq_list2 = [(val, key) for key, val in freq_dic.items()] # sort by val or frequency freq_list2.sort(reverse=True) freq_list3 = list(freq_list2) # display result as top 10 most frequent words freq_list4 =[] freq_list4=freq_list3[:10] words = [] for item in freq_list4: a = str(item[1]) a = a.lower() words.append(a) f = open(filename) newlist = [] for line in f: line = punctuation.sub("", line) line = line.lower() newlist.append(line) f2 = open('Lines.txt','w') newlist2= [] for line in newlist: line = line.split() newlist2.append(line) f2.write(str(line)) f2.write("\n") print newlist2 # ARFF Creation arff = open('output.arff','w') arff.write('@RELATION wordfrequency\n\n') for word in words: arff.write('@ATTRIBUTE ') arff.write(str(word)) arff.write(' numeric\n') arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n') arff.write('@DATA\n') # Counting word frequencies for each verse for line in newlist2: word_occurrences = str("") for word in words: matches = int(0) for item in line: if str(item) == str(word): matches = matches + int(1) else: continue word_occurrences = word_occurrences + str(matches) + "," word_occurrences = word_occurrences + "endofworld" arff.write(word_occurrences) arff.write("\n") print words </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload