Note that there are some explanatory texts on larger screens.

plurals
  1. PONER naive algorithm
    primarykey
    data
    text
    <p>I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended. </p> <p>The idea was to extract names of the main characters in a story through:</p> <ol> <li>Building a dictionary for each word</li> <li>Filling for each word a list with the words that appear right next to it in the text</li> <li>Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)</li> <li>Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive) </li> </ol> <p>I ran the overly simple code (attached below) on <a href="http://www.umich.edu/~umfandsf/other/ebooks/alice30.txt" rel="nofollow">Alice in Wonderland</a>, which for "Alice" returns:</p> <blockquote> <p>21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']</p> </blockquote> <p>Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as <a href="http://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland#Characters" rel="nofollow">main characters</a> goes.</p> <p>It does not work that well with other characters and in other stories, though gives interesting results. </p> <p>Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?</p> <p>Thanks!</p> <pre><code>#English Name recognition import re import sys import random from string import upper def mimic_dict(filename): dict = {} f = open(filename) text = f.read() f.close() prev = "" words = text.split() for word in words: m = re.search("\w+",word) if m == None: continue word = m.group() if not prev in dict: dict[prev] = [word] else : dict[prev] = dict[prev] + [word] prev = word return dict def main(): if len(sys.argv) != 2: print 'usage: ./main.py file-to-read' sys.exit(1) dict = mimic_dict(sys.argv[1]) upper = [] for e in dict.keys(): if len(e) &gt; 1 and e[0].isupper(): upper.append(e) print len(upper),upper exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"] exclude = [ x for x in exclude if dict.has_key(x)] for s in exclude : del dict[s] scores = {} for key1 in dict.keys(): max = 0 for key2 in dict.keys(): if key1 == key2 : continue a = dict[key1] k = dict[key2] diff = [] for ia in a: if ia in k and ia not in diff: diff.append( ia) if len(diff) &gt; max: max = len(diff) scores[key1]=(key2,max) dictscores = {} names = [] for e in scores.keys(): if scores[e][0]=="Alice" and e[0].isupper(): names.append(e) print len(names), names if __name__ == '__main__': main() </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload