StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
4284899
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2010-11-26T11:33:56.130
FavoriteCount
0
LastActivityDate
2010-11-26T14:11:17.923
LastEditDate
2010-11-26T14:11:17.923
LastEditorUserId
484814
OwnerUserId
484814
ParentId
4275954
PostTypeId
2
Score
2
ViewCount
0
LastEditorDisplayName
text
Body
The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions. <ol> <li>For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.</li> <li>You are used to the common and nonsensical <code>Id</code> column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)</li> </ol> Solution This is a condensed version of the <a href="http://www.idef.com/IDEF1x.htm" rel="nofollow">IDEF1X</a> Standard for modelling Relational Databases; the portion re Identifiers. <ol> <li>You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment <code>Id</code> column. That column gives a spreadsheet a <code>RowId</code>, but it does not imply anything about the content of the table, or the columns that identify a product.</li> <li>And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?</li> <li>Identify all the columns and datatypes for the columns.</li> <li>Identify which columns are mandatory and which are optional.</li> <li>Identify which are strong Identifiers. Eg. <code>Manufacturer</code> and <code>Model</code>; the short <code>Product Name</code>, not the long <code>Description</code> (or may be for your company, the long description is an Identifier). Work with your users, and work that out.</li> <li>You will find you actually have a small cluster of tables around <code>Product</code>, such as <code>Manufacturer</code>, <code>ProductType</code>, perhaps <code>Vendor</code>, etc.</li> <li>Organise those tables, and Normalise them, so that you are not duplicating data.</li> <li>Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in <code>Product</code>. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.</li> <li>What makes a single Unique Identifier for <code>Product</code> may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.</li> <li>Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key. </li> <li>If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the <code>Id</code> column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of <code>Product</code>, the Candidate Keys; they cannot be removed.</li> </ol> So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web. Feeds <ol> <li>You need a <code>WebSite</code> table to manage the feeds.</li> <li>There will be an Associative table (many-to-many) between <code>Product</code> and <code>WebSite</code>. Let's call it <code>ProductSite</code>. It will contain only our <code>ProductId</code>, and the <code>WebSiteCode. It may contain</code>Price`. The contents are valid for a single feed cycle.</li> <li>Load each feed into a staging database or schema, an incoming <code>ProductIn</code> table, maybe one per source website. This is just the flat file from the external source. Add a column <code>IsValid</code> and set the Default to true.</li> <li>Then write some SQL that compares that <code>ProductIn</code> table, with its loose and floppy contents, with our <code>Product</code> table with its strong Identifiers. <ul> <li>The way I would do it is, several waves of separate checks, each marking the rows that fail, with <code>IsValid</code> to false. At the end Insert the <code>IsValid</code> rows into our <code>ProductSite</code>. </li> <li>You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the <code>IsValid</code> boolean).</li> <li>This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our <code>Product</code> table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.</li> </ul></li> <li>Repeat for each WebSite.</li> <li>Now populate our website from our <code>Product</code> table, using information that we are confident about, and show which sites have the product for sale from <code>ProductSite</code>.</li> </ol>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow to identify duplicate items gathered from multiple feeds and link to them in a Database
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USPerformanceDBA
UserOwnerUserId
1. USPerformanceDBA
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POHow to identify duplicate items gathered from multiple feeds and link to them in a Database
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.