StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
4065298
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2010-10-31T21:43:45.437
FavoriteCount
0
LastActivityDate
2010-10-31T22:26:08.180
LastEditDate
2010-10-31T22:26:08.180
LastEditorUserId
471272
OwnerUserId
471272
ParentId
4065188
PostTypeId
2
Score
5
ViewCount
0
LastEditorDisplayName
text
Body
Yes, I imagine what you're doing there is extremely slow, albeit for a couple of reasons. I think you need to process your stopwords regex before you build up your string of a billion words from your corpus. I have no idea what a .regex file is, but I'm going to presume it contains a legal Perl regular expression, something that you can compile using no more than: <pre><code>$stopword_string = `cat foo.regex`; $stopword_rx = qr/$stopword_string/; </code></pre> That probably presumes that there's a <code>(?x)</code> at the start. But if your stopword file is a list of lines, you will need to do something more like this: <pre><code>chomp(@stopwords = `cat foo.regex`); # if each stopword is an independent regex: $stopword_string = join "|" => @stopwords; # else if each stopword is a literal $stopword_string = join "|" => map {quotemeta} @stopwords; # now compile it (maybe add some qr//OPTS) $stopword_rx = qr/\b(?:$stopword_string)\b/; </code></pre> <h2>WARNING</h2> Be very careful with <code>\b</code>: it's only going to do what you think it does above if the first character in the first word and the last character in the last word is an alphanumunder (a <code>\w</code> character). Otherwise, it will be asserting something you probably don't mean. If that could be a possibility, you will need to be more specific. The leading <code>\b</code> would need to become <code>(?:(?<=\A)|(?<=\s))</code>, and the trailing <code>\b</code> would need to become <code>(?=\s|\z)</code>. That's what most people think <code>\b</code> means, but it really doesn't. Having done that, you should apply the stopword regex to the corpus as you're reading it in. The best way to do this is not to put the stuff into your string in the first place that you'll just need to take out later. So instead of doing <pre><code>$corpus_text = `cat some-giant-file`; $corpus_text =~ s/$stopword_rx//g; </code></pre> Instead do <pre><code>my $corpus_path = "/some/path/goes/here"; open(my $corpus_fh, "< :encoding(UTF-8)", $corpus_path) || die "$0: couldn't open $corpus_path: $!"; my $corpus_text = q##; while (<$corpus_fh>) { chomp; # or not $corpus_text .= $_ unless /$stopword_rx/; } close($corpus_fh) || die "$0: couldn't close $corpus_path: $!"; </code></pre> That will be much faster than putting stuff in there that you just have to weed out again later. My use of <code>cat</code> above is just a shortcut. I don't expect you to actually call a program, least of all <code>cat</code>, just to read in a single file, unprocessed and unmolested. ☺
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow can I remove stop words from a large text file?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. UStchrist
UserOwnerUserId
1. UStchrist
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COTip for those reading: nobody sane should use `$arg = \`cat file\`` , they should use `$arg = File::Slurp::slurp($file) ` or similar.
 singulars
 PostPostId
 PO
 UserUserId
 USKent Fredric

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.