StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
2751475
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2010-05-01T20:45:26.903
FavoriteCount
0
LastActivityDate
2010-05-01T21:26:31.987
LastEditDate
2010-05-01T21:26:31.987
LastEditorUserId
86542
OwnerUserId
86542
ParentId
2749150
PostTypeId
2
Score
6
ViewCount
0
LastEditorDisplayName
text
Body
N-gram Language Models You could try training one <a href="http://en.wikipedia.org/wiki/N-gram" rel="nofollow noreferrer">n-gram language model</a> on the autogenerated spam pages and one on a collection of other non-spam webpages. You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content. Better Scoring through Bayes Law When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, <code>P(Text|Spam)</code>. The notation reads as the probability of <code>Text</code> given <code>Spam (page)</code>. The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, <code>P(Text|Non-Spam)</code>. However, the term you probably really want is <code>P(Spam|Text)</code> or, equivalently <code>P(Non-Spam|Text)</code>. That is, you want to know the probability that a page is <code>Spam</code> or <code>Non-Spam</code> given the text that appears on it. To get either of these, you'll need to use <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem" rel="nofollow noreferrer">Bayes Law</a>, which states <pre><code> P(B|A)P(A) P(A|B) = ------------ P(B) </code></pre> Using Bayes law, we have <pre><code>P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text) </code></pre> and <pre><code>P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text) </code></pre> <code>P(Spam)</code> is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off <a href="http://en.wikipedia.org/wiki/Precision_and_recall" rel="nofollow noreferrer">precision and recall</a>. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam. The term <code>P(Text)</code> is the overall probability of finding <code>Text</code> on any webpage. If we ignore that <code>P(Text|Spam)</code> and <code>P(Text|Non-Spam)</code> were determined using different models, this can be calculated as <code>P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam)</code>. This sums out the binary variable <code>Spam</code>/<code>Non-Spam</code>. Classification Only However, if you're not going to use the probabilities for anything else, you don't need to calculate <code>P(Text)</code>. Rather, you can just compare the numerators <code>P(Text|Spam)P(Spam)</code> and <code>P(Text|Non-Spam)P(Non-Spam)</code>. If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both <code>P(Spam|Text)</code> and <code>P(Non-Spam|Text)</code> are normalized by the same <code>P(Text)</code> value. Tools In terms of software toolkits you could use for something like this, <a href="http://www-speech.sri.com/projects/srilm/download.html" rel="nofollow noreferrer">SRILM</a> would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use <a href="http://sourceforge.net/projects/irstlm/" rel="nofollow noreferrer">IRST LM</a>, which is distributed under the LGPL. 
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow to estimate the quality of a web page?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USdmcer
UserOwnerUserId
1. USdmcer
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POHow to estimate the quality of a web page?
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.