StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSafely normalizing data via SQL query
primarykey
Id
987893
data
AcceptedAnswerId
988189
AnswerCount
8
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2009-06-12T17:18:06.497
FavoriteCount
2
LastActivityDate
2016-01-05T14:48:53.397
LastEditDate
2009-06-13T10:22:57.250
LastEditorUserId
41619
OwnerUserId
41619
ParentId
0
PostTypeId
1
Score
4
ViewCount
9502
LastEditorDisplayName
text
Body
Suppose I have a table of customers: <pre><code>CREATE TABLE customers ( customer_number INTEGER, customer_name VARCHAR(...), customer_address VARCHAR(...) ) </code></pre> This table does not have a primary key. However, <code>customer_name</code> and <code>customer_address</code> should be unique for any given <code>customer_number</code>. It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers: <pre><code>SELECT DISTINCT customer_number, customer_name, customer_address FROM customers </code></pre> Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting <code>customer_name</code> or <code>customer_address</code> for any <code>customer_number</code>. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the <code>customer_number</code> in question. For example, I tried this query with no success: <pre><code>SELECT customer_number, DISTINCT(customer_name, customer_address) FROM customers GROUP BY customer_number </code></pre> Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL? EDIT: The rationale behind the bizarre query: Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table... I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table: <pre><code>CREATE TABLE unprocessed_invoices ( invoice_number INTEGER, invoice_date DATE, ... // other invoice columns ... customer_number INTEGER, customer_name VARCHAR(...), customer_address VARCHAR(...) ) </code></pre> As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data. The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over. <pre><code>SELECT customer_number, DISTINCT(customer_name, customer_address) FROM unprocessed_invoices GROUP BY customer_number </code></pre> Hopefully this helps clarify the original intent of the question. EDIT: Examples of good/bad data To clarify: <code>customer_name</code> and <code>customer_address</code> only have to be unique for a particular <code>customer_number</code>. <pre><code> customer_number | customer_name | customer_address ---------------------------------------------------- 1 | 'Bob' | '123 Street' 1 | 'Bob' | '123 Street' 2 | 'Bob' | '123 Street' 2 | 'Bob' | '123 Street' 3 | 'Fred' | '456 Avenue' 3 | 'Fred' | '789 Crescent' </code></pre> The first two rows are fine because it is the same <code>customer_name</code> and <code>customer_address</code> for <code>customer_number</code> 1. The middle two rows are fine because it is the same <code>customer_name</code> and <code>customer_address</code> for <code>customer_number</code> 2 (even though another <code>customer_number</code> has the same <code>customer_name</code> and <code>customer_address</code>). The last two rows are not okay because there are two different <code>customer_address</code>es for <code>customer_number</code> 3. The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return: <pre><code> customer_number | customer_name | customer_address ---------------------------------------------------- 1 | 'Bob' | '123 Street' 2 | 'Bob' | '123 Street' </code></pre> I hope this clarifies what I meant by "conflicting <code>customer_name</code> and <code>customer_address</code>". They have to be unique per <code>customer_number</code>. I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)
Tags
<sql><denormalization>
Title
Safely normalizing data via SQL query
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USAdam Paynter
UserOwnerUserId
1. USAdam Paynter
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POSafely normalizing data via SQL query
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POSafely normalizing data via SQL query
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POSafely normalizing data via SQL query
 UserUserId
 USIain Samuel McLean Elder
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. COWhat do you mean by "fail, rather than returning multiple rows"? Usually when I think of a sql query failing, that mean I got no rows, or a product of tables that I'm joining. I though you were looking for a select * from (select count(*) as cnt, customer_number, customer_name, customer_address From customers group by customer_number, customer_name, customer_address) where cnt > 1 type of query.
 singulars
 PostPostId
 POSafely normalizing data via SQL query
 UserUserId
 USCharlie
2. COBy "fail", I meant the DBMS should return an error rather than results (like when you query a non-existent table). I understand that I can use the "SELECT COUNT(*) ... GROUP BY ... HAVING ..." query to identify the presence of conflicting data, however I was simply curious to see if a query similar to the one I described actually existed. Thanks for the comment, though! :)
 singulars
 PostPostId
 POSafely normalizing data via SQL query
 UserUserId
 USAdam Paynter

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.