Note that there are some explanatory texts on larger screens.

plurals
  1. POmemory usage of a probabilistic parser
    text
    copied!<p>I am writing a CKY parser for a Range Concatenation Grammar. I want to use a treebank as grammar, so the grammar will be large. I've written a prototype <a href="http://www.nltk.org/download" rel="nofollow">1</a> in Python and it seems to work well when I simulate a treebank of a couple tens of sentences, but the memory usage is unacceptable. I tried writing it in C++ but so far that has been very frustrating as I have never used C++ before. Here's some data (n is number of sentences the grammar is based on):</p> <pre><code>n mem 9 173M 18 486M 36 836M </code></pre> <p>This growth pattern is what is to be expected given the best-first algorithm, but the amount of overhead is what concerns me. The memory usage according to heapy is a factor ten smaller than these numbers, valgrind reported something similar. What causes this discrepancy and is there anything I can do about it in Python (or Cython)? Perhaps it's due to fragmentation? Or maybe it is the overhead of python dictionaries?</p> <p>Some background: the two important datastructures are the agenda mapping edges to probabilities, and the chart, which is a dictionary mapping nonterminals and positions to edges. The agenda is implemented with a heapdict (which internally uses a dict and a heapq list), the chart with a dictionary mapping nonterminals and positions to edges. The agenda is frequently inserted and removed from, the chart only gets insertions and lookups. I represent edges with tuples like this:</p> <pre><code>(("S", 111), ("NP", 010), ("VP", 100, 001)) </code></pre> <p>The strings are the nonterminal labels from the grammar, the positions are encoded as a bitmask. There can be multiple positions when a constituent is discontinuous. So this edge could be represent an analysis of "is Mary happy", where "is" and happy" both belong to the VP. The chart dictionary is indexed by the first element of this edge, ("S", 111) in this case. In a new version I tried transposing this representation in the hope that it would save memory due to reuse:</p> <pre><code>(("S", "NP", "VP), (111, 100, 011)) </code></pre> <p>I figured that Python would store the first part only once if it would occur in combination with different positions, although I'm not actually sure this is true. In either case, it didn't seem to make any difference.</p> <p>So basically what I am wondering is if it is worth pursuing my Python implementation any further, including doing things with Cython and different datastructures, or that writing it from the ground up in C++ is the only viable option.</p> <p><strong>UPDATE</strong>: After some improvements I no longer have issues with memory usage. I'm working on an optimized Cython version. I'll award the bounty to the most useful suggestion for increasing efficiency of the code. There is an annotated version at <a href="http://student.science.uva.nl/~acranenb/plcfrs_cython.html" rel="nofollow">http://student.science.uva.nl/~acranenb/plcfrs_cython.html</a></p> <p><a href="http://www.nltk.org/download" rel="nofollow">1</a> <a href="https://github.com/andreasvc/disco-dop/" rel="nofollow">https://github.com/andreasvc/disco-dop/</a> -- run test.py to parse some sentences. Requires python 2.6, <a href="http://www.nltk.org/download" rel="nofollow">nltk</a> and <a href="http://pypi.python.org/pypi/HeapDict" rel="nofollow">heapdict</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload