StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p><strong>Automatic Text Summarization</strong></p> <p>It sounds like you're interested in <a href="http://en.wikipedia.org/wiki/Automatic_summarization" rel="noreferrer"><strong>automatic text summarization</strong></a>. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper <a href="http://www.cs.cmu.edu/~nasmith/LS2/das-martins.07.pdf" rel="noreferrer"><strong>A Survey on Automatic Text Summarization</strong></a> (2007).</p> <p><strong>Simple Algorithm</strong></p> <p>A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including <a href="http://en.wikipedia.org/wiki/Stop_words" rel="noreferrer"><strong>stop list</strong></a> words). </p> <pre><code>Summarizer(originalText, maxSummarySize): // start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...] wordFrequences = getWordCounts(originalText) // filter, e.g. [(3, 'language'), (8, 'code')...] contentWordFrequences = filtStopWords(wordFrequences) // sort by freq & drop counts, e.g. ['code', 'language'...] contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences) // Split Sentences sentences = getSentences(originalText) // Select up to maxSummarySize sentences setSummarySentences = {} foreach word in contentWordsSortbyFreq: firstMatchingSentence = search(sentences, word) setSummarySentences.add(firstMatchingSentence) if setSummarySentences.size() = maxSummarySize: break // construct summary out of select sentences, preserving original ordering summary = "" foreach sentence in sentences: if sentence in setSummarySentences: summary = summary + " " + sentence return summary </code></pre> <p>Some open source packages that do summarization using this algorithm are:</p> <p><strong>Classifier4J (Java)</strong></p> <p>If you're using Java, you can use <a href="http://classifier4j.sourceforge.net/" rel="noreferrer"><strong>Classifier4J</strong></a>'s module <a href="http://classifier4j.sourceforge.net/subprojects/core/apidocs/net/sf/classifier4J/summariser/SimpleSummariser.html" rel="noreferrer">SimpleSummarizer</a>. </p> <p>Using the example found <a href="http://classifier4j.sourceforge.net/usage.html#Using_ISummariser" rel="noreferrer">here</a>, let's assume the original text is:</p> <blockquote> Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers. </blockquote> <p>As seen in the following snippet, you can easily create a simple one sentence summary:</p> <pre><code>// Request a 1 sentence summary String summary = summariser.summarise(longOriginalText, 1); </code></pre> <p>Using the algorithm above, this will produce <code>Classifier4J includes a summariser.</code>.</p> <p><strong>NClassifier (C#)</strong></p> <p>If you're using C#, there's a port of Classifier4J to C# called <a href="http://nclassifier.sourceforge.net/" rel="noreferrer"><strong>NClassifier</strong></a></p> <p><strong>Tristan Havelick's Summarizer for NLTK (Python)</strong></p> <p>There's a work-in-progress Python port of Classifier4J's summarizer built with Python's <a href="http://www.nltk.org/" rel="noreferrer">Natural Language Toolkit (NLTK)</a> available <a href="http://groups.google.com/group/nltk-dev/browse_thread/thread/a95f5ee53b020478?pli=1" rel="noreferrer"><strong>here</strong></a>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload