Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p><strong>Iteratively replacing words is going to be the biggest bottleneck in your implementation.</strong> On each iteration, you have to scan the entire string for the stopword, then the replace operation has to allocate a new string and populate it with the text post-replacement. That's not going to be fast.</p> <p><strong>A much more efficient approach is to tokenize the string and perform replacement in a streaming fashion.</strong> Divide the input into individual words separated by whatever whitespace or separator characters are appropriate. You can do this incrementally so you don't need to allocate any additional memory to do so. For each word (token), you can now perform a lookup in a hashset of stopwords - if you find match, you will replace it as you stream out the final text to a separate <code>StringBuilder</code>. If the token is not a stopword, just stream it out the <code>StringBuilder</code> unmodified. This approach should have O(n) performance, as it only scans the string once and uses a <code>HashSet</code> to perform stopword lookup.</p> <p>Below is one approach that I would expect to perform better. While it isn't fully streaming (it uses <code>String.Split()</code> which allocated an array of additional strings), it does all of the processing in a single pass. Refining the code to avoid allocating additional string is probably not going to provide much of an improvement, since you still need to extract out substrings to perform comparisons to your stopwords.</p> <p>Code below returns a list of words that excludes all stopwords and words two letters or shorter form the result. It uses case-insensitive comparison on the stopwords as well.</p> <pre><code>public IEnumerable&lt;string&gt; SplitIntoWords( string input, IEnumerable&lt;string&gt; stopwords ) { // use case-insensitive comparison when matching stopwords var comparer = StringComparer.InvariantCultureIgnoreCase; var stopwordsSet = new HashSet&lt;string&gt;( stopwords, comparer ); var splitOn = new char[] { ' ', '\t', '\r' ,'\n' }; // if your splitting is more complicated, you could use RegEx instead... // if this becomes a bottleneck, you could use loop over the string using // string.IndexOf() - but you would still need to allocate an extra string // to perform comparison, so it's unclear if that would be better or not var words = input.Split( splitOn, StringSplitOptions.RemoveEmptyEntries ); // return all words longer than 2 letters that are not stopwords... return words.Where( w =&gt; !stopwordsSet.Contains( w ) &amp;&amp; w.Length &gt; 2 ); } </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload