Note that there are some explanatory texts on larger screens.

plurals
  1. POBuilding a lemmatizer: speed optimization
    primarykey
    data
    text
    <p>I am building a lemmatizer in python. As I need it to run in realtime/process fairly large amount of data the processing speed is of the essence. Data: I have all possible suffixes that are linked to all wordtypes that they can be combined with. Additionally I have lemmaforms that are linked to both their wordtype(s) and lemma(s). The program takes a word as input and outputs its lemma. word = lemmafrom + suffix</p> <p>For example (Note: although the example is given in English I am not building a lemmatizer for English):</p> <p>word: forbidding</p> <p>lemmaform: forbidd</p> <p>suffix: ing</p> <p>lemma: forbid</p> <p>My solution:</p> <p>I have converted the data to (nested) dicts:</p> <pre><code>suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... , type(n)]} lemmaformdict : {lemmaform:{type1:lemma}} </code></pre> <p>1) Find all possible suffixes and word types that they are linked to. If the longest possible suffix is 3 characters long, the program tries to match 'ing', 'ng', 'n' to the keys in suffixdict. If the key exists it returns a value (a set of wordtypes).</p> <p>2) For each matching suffix search the lemmaform from the dict. If lemmaform exists it returns the wordtypes.</p> <p>3) Finally, the program tries to intersect the wordtypes produced in steps 1) ans 2) and if the intersection is sucessful it returns the lemma of the word.</p> <p>My question: could there be a better solution to my problem from the prespective of speed? (Disregarding the option to keep frequent words and lemmas in the dictionary) Help much appriciated.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload