Note that there are some explanatory texts on larger screens.

plurals
  1. POWriting a tokenizer in Python
    primarykey
    data
    text
    <p>I want to design a custom tokenizer module in Python that lets users specify what tokenizer(s) to use for the input. For instance, consider the following input:</p> <blockquote> <p>Q: What is a good way to achieve this? A: I am not so sure. I think I will use Python.</p> </blockquote> <p>I want to be able to provide <a href="http://nltk.org/api/nltk.tokenize.html">NLTK's sentence tokenization</a>, <code>sent_tokenize()</code> as an option because it works well in many situations and I don't want to re-invent the wheel. In addition to this, I also want to provide a finer-grained tokenization builder (something along the lines of a rule-engine). Let me explain:</p> <p>Assume that I provider a couple of tokenizers:</p> <pre><code>SENTENCE # Tokenizes the given input by using sent_tokenize() WORD # Tokenizes the given input by using word_tokenize() QA # Tokenizes using a custom regular expression. E.g., Q: (.*?) A: (.*?) </code></pre> <p>I want to support rules as follows:</p> <ol> <li>QA -> SENTENCE: Apply the QA tokenizer first, followed by the sentence tokenizer</li> <li>QA: Apply just the QA tokenizer</li> </ol> <p>Therefore, the expected output is as follows:</p> <p><strong>1. QA -> SENTENCE</strong></p> <pre><code>[ ('QUESTION', ('SENTENCE', 'What is a good way to achieve this?'), ), ('ANSWER', ('SENTENCE', 'I am not so sure', 'I think I will use Python') ) ] </code></pre> <p><strong>2. QA</strong></p> <pre><code>[ ('QUESTION', 'What is a good way to achieve this?'), ('ANSWER', 'I am not so sure. I think I will use Python') ] </code></pre> <p>What is a good design to achieve this efficiently?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload