StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
1587064
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2009-10-19T05:14:25.520
FavoriteCount
0
LastActivityDate
2009-10-19T05:55:22.100
LastEditDate
2009-10-19T05:55:22.100
LastEditorUserId
166686
OwnerUserId
166686
ParentId
1587020
PostTypeId
2
Score
4
ViewCount
0
LastEditorDisplayName
text
Body
The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case. By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas. Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise. Some of the simpler metrics are: <ul> <li>length of the longest common sequence of words</li> <li>number of common words</li> <li>number of common sequences of words of more than n words</li> <li>number of common words for the top n most frequent words within each document.</li> <li>length of the document</li> </ul> Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages) More complicated and/or computationally expensive measurements are: <ul> <li>Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)</li> <li>Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)</li> <li>Linear algebra-based measurements</li> <li>Statistical methods such as Bayesian fitlers </li> </ul> In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs. Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to... <ul> <li>normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)</li> <li>exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.</li> <li>identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.</li> </ul>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POhow to get the similar texts from a lot of pages?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USmjv
UserOwnerUserId
1. USmjv
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.