Note that there are some explanatory texts on larger screens.

plurals
  1. POPython: question about parsing human-readable text
    text
    copied!<p>I'm parsing human-readable scientific text that is mostly in the field of chemistry. What I'm interested in is breaking the text into a list of words, scientific terms (more on that below), and punctuation marks.</p> <p>So for example, I expect the text "hello, world." to break into 4 tokens: 1) "hello"; 2) comma; 3) "world" and 4) period. Note that spaces don't require specialized tokens. </p> <p>The problem is related to the "scientific terms": these are names of chemical formulas such as "1-methyl-4-phenylpyridinium". Anyone who has ever learned chemistry knows these formulas can get quite long and may contain numbers, dashes and commas, and sometimes even parentheses, but I think it's safe to assume these lovely expressions can't contain spaces. Also, I believe these expressions must start with a number. I would like each such expression to come out as a single token.</p> <p>Today I use manual parsing to find "chunks" of text that begin with a number and end with either a space, a line break, or a punctuation mark followed by either a space or line break. </p> <p>I wondered if there's a smart solution (regex or other) I can use to tokenize the text according to the above specifications. I'm working in Python but this may be language agnostic.</p> <p>An example input (obviously disregard the content...):</p> <p>"Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse."</p> <p>Example output (each token in its own line):</p> <pre><code>Hello . 1-methyl-4-phenylpyridinium is ultra - bad . However , 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse . </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload