Note that there are some explanatory texts on larger screens.

plurals
  1. POFast Text Preprocessing
    primarykey
    data
    text
    <p>In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:</p> <p>get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing</p> <p>In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).</p> <p>I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it's relatively OK.</p> <p>Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.</p> <p>Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:</p> <pre><code> result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " "); while (stopwords.MoveNext()) { string stopword = stopwords.Current.ToString(); result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " "); } return result; </code></pre> <p>First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.</p> <p>Is it possible to make it faster?</p> <p><strong>Edit:</strong></p> <p>What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including '.', ',', '?', '!', etc.) numbers, stop words. I need only pure words that I can use for Data Mining.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload