Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><a href="http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation" rel="nofollow noreferrer">Sentence boundary disambiguation</a> (SBD) is a central problem in the field of NLP. Unfortunately, those I've found and used in the past aren't in C (as it's not the favourite language for string based tasks, unless speed is a major issue)</p> <p><strong>Pipeline</strong></p> <p>If at all possible I'd create a simple pipeline - if on a Unix system this shouldn't be a problem, but even if you're on Windows with a scripting language you should be able to fill in the gaps. This means that the SBD can be the best tool for the job, not merely the only SBD you could find for language Z. For example,</p> <pre><code>./pdfconvert | SBD | my_C_tool &gt; ... </code></pre> <p>This is the standard way we do things in my work, and unless you have more strict requirements than you've stated it should be fine.</p> <p><strong>Tools</strong></p> <p>In regards to the tools you can use,</p> <ul> <li>I'd suggest MXTERMINATOR, which is a SBD tool using Maximum Entropy modelling, as my supervisors used it in their own work recently. According to them it did miss a few sentence splits, but that was easily fixed by a <a href="http://en.wikipedia.org/wiki/Sed" rel="nofollow noreferrer">sed script</a>. They were doing SBD on astronomical papers. The <a href="http://www.cis.upenn.edu/~adwait/" rel="nofollow noreferrer">main site</a> appears down at the moment, but there is an FTP mirror available <a href="ftp://ftp.cis.upenn.edu/pub/adwait/jmx/" rel="nofollow noreferrer">here</a>.</li> <li><a href="http://opennlp.sourceforge.net/" rel="nofollow noreferrer">OpenNLP</a> have a reimplementation of the above algorithm using Maximum Entropy modelling in Java (<a href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetectorME.html" rel="nofollow noreferrer">JavaDoc</a>) and is more up to date with a seemingly stronger community behind it.</li> <li><a href="http://www.denkselbst.de/sentrick/index.html" rel="nofollow noreferrer">Sentrick</a> and many others exist also. For more there is an older list <a href="http://www.uib.no/mailman/public/corpora/2007-October/005429.html" rel="nofollow noreferrer">here</a> that may be of use.</li> </ul> <p><strong>Models and Training</strong></p> <p>Now, some of these tools may give you good results out of the box, but some may not. OpenNLP includes a model for <a href="http://opennlp.sourceforge.net/README.html" rel="nofollow noreferrer">English sentence detection</a> out of the box, which may work for you. If your domain is significantly different to the one which the tools were trained on they may not perform well however. For example, if they were trained on newspaper text they may be very good at that task but horrible at letters.</p> <p>As such, you may want to train the SBD tool by giving it examples. Each of the tools should document this process, but I will warn you, it may be a bit of work. It would require you running the tool on document X, going through and manually fixing any incorrect splits and giving the correctly split document X back to the tool to train on. Depending on the sizes of the documents and the tool involved you may need to do this for one or a hundred documents until you get a reasonable result.</p> <p>Good luck, and if you've any questions feel free to ask.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload