Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Yes, I imagine what you're doing there is extremely slow, albeit for a couple of reasons. I think you need to process your stopwords regex <em>before</em> you build up your string of a billion words from your corpus.</p> <p>I have no idea what a .regex file is, but I'm going to presume it contains a legal Perl regular expression, something that you can compile using no more than:</p> <pre><code>$stopword_string = `cat foo.regex`; $stopword_rx = qr/$stopword_string/; </code></pre> <p>That probably presumes that there's a <code>(?x)</code> at the start.</p> <p>But if your stopword file is a list of lines, you will need to do something more like this:</p> <pre><code>chomp(@stopwords = `cat foo.regex`); # if each stopword is an independent regex: $stopword_string = join "|" =&gt; @stopwords; # else if each stopword is a literal $stopword_string = join "|" =&gt; map {quotemeta} @stopwords; # now compile it (maybe add some qr//OPTS) $stopword_rx = qr/\b(?:$stopword_string)\b/; </code></pre> <h2>WARNING</h2> <p>Be <em>very</em> careful with <code>\b</code>: it's only going to do what you think it does above if the first character in the first word and the last character in the last word is an alphanumunder (a <code>\w</code> character). Otherwise, it will be asserting something you probably don't mean. If that could be a possibility, you will need to be more specific. The leading <code>\b</code> would need to become <code>(?:(?&lt;=\A)|(?&lt;=\s))</code>, and the trailing <code>\b</code> would need to become <code>(?=\s|\z)</code>. That's what most people <em>think</em> <code>\b</code> means, but it really doesn't.</p> <p>Having done that, you should apply the stopword regex to the corpus as you're reading it in. The best way to do this is <strong>not</strong> to put the stuff into your string in the first place that you'll just need to take out later.</p> <p>So instead of doing</p> <pre><code>$corpus_text = `cat some-giant-file`; $corpus_text =~ s/$stopword_rx//g; </code></pre> <p>Instead do </p> <pre><code>my $corpus_path = "/some/path/goes/here"; open(my $corpus_fh, "&lt; :encoding(UTF-8)", $corpus_path) || die "$0: couldn't open $corpus_path: $!"; my $corpus_text = q##; while (&lt;$corpus_fh&gt;) { chomp; # or not $corpus_text .= $_ unless /$stopword_rx/; } close($corpus_fh) || die "$0: couldn't close $corpus_path: $!"; </code></pre> <p>That will be much faster than putting stuff in there that you just have to weed out again later.</p> <p>My use of <code>cat</code> above is just a shortcut. I don't expect you to actually call a program, least of all <code>cat</code>, just to read in a single file, unprocessed and unmolested. ☺</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload