StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCategorizing data based on the data's signature
primarykey
Id
4770974
data
AcceptedAnswerId
4771982
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2011-01-22T22:14:14.240
FavoriteCount
2
LastActivityDate
2011-01-23T15:03:59.157
LastEditDate
LastEditorUserId
0
OwnerUserId
21317
ParentId
0
PostTypeId
1
Score
2
ViewCount
188
LastEditorDisplayName
text
Body
Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair: <pre><code>1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")] 2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")] n) .... n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")] </code></pre> I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row? The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive. I am looking for a solution of the following form. <ul> <li>A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.</li> <li>I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B. </li> <li>When I have a new row, I wish to know which existent row in database B is most similar, I would: <ol> <li>Generate a signature for the new row</li> <li>Binary search through the sorted list of (signature,index) in database B for the closet match</li> <li>Return the closest matching (could be a perfect match) row in database B.</li> </ol></li> </ul> I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this. Thank you.
Tags
<database><hash><indexing><categorization>
Title
Categorizing data based on the data's signature
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USStephen Cagle
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POCategorizing data based on the data's signature
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POCategorizing data based on the data's signature
 UserUserId
 USStephen Cagle
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POCategorizing data based on the data's signature
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.