Friday, December 7, 2007

When is something too DISTINCT?

Lets say you have a database product table with the following columns: id (numeric key), name (text), type (text), distributer(foreign key).

Lets also say you need to find all of the product types for a given category. Oh, and you need a candidate product_id for each product type (doesn't matter which one, but it must be a key of a record who has that type).

What is the best way to write a SQL query to find this information? You can interpret "best" to mean "most efficient", "most readable" or whatever you want.


As an explicit example:

Given the following data:
ID - NAME - TYPE - CATEGORY
1 - Super Truck - Toy - 100
2 - Cool Doll - Doll - 100
3 - Finger Paint - Drawing - 100
4 - Lame Block - Toy - 222
5 - Stinky Doll - Doll - 100

The query should return the following:
ID - TYPE
1 - TOY
(2 or 5) - Doll
3 - Drawing

I will post my "hack" later.

3 comments:

mrLee said...

I'm not sure if I would call this response the most anything, but the choices seem resonable. I worked off the assumption that the Products.ID field was a primary key. Even if you mean for your Products.type field to be a foreign key, you don't need to join it to an implied Product_Types table to produce the query. Here, your foreign key example is descriptive enough that one need not rejoin to the Product_Types to be able to describe each type. So my method doesn't even depend on the field Product.type even being a foreing key.
-----------------------------
If you wrote it like this:

SELECT Products.ID, Products.type FROM Products WHERE (Products.category="100") GROUP BY Products.ID, Products.type ORDER BY Products.type;

it would keep the ID number in the output, so it would still give a separate line like your example output for both ID=2 and ID=5, both of type=Doll

type ID
Doll 2
Doll 5
Drawing 3
Toy 1
-------------------------------------------------------
and if you wanted to label the output on each line, to remind yourself which category you selected (here category="100"), you could include that too by including the category field in both your SELECT and your GROUP BY clauses

SELECT Products.category, Products.type, Products.ID FROM Products WHERE (Products.category="100") GROUP BY Products.category, Products.type, Products.ID;

category type ID
100 Doll 2
100 Doll 5
100 Drawing 3
100 Toy 1
-------------------------------------------------------
I don't see why you would want both dolls to list separately if all you really wanted to know is which types of products occur in a given category, so I would think to write

SELECT Products.category, Products.type FROM Products WHERE (Products.category="100")
GROUP BY Products.category, Products.type;

which will produce

category type
100 Doll
100 Drawing
100 Toy
-------------------------------------------------------
or perhaps count how many products of each type (in this example this equates to counting records in the Products table), so you don't lose the fact that there were two doll-type products in this category, by writing

SELECT Products.category, Products.type, Count(*) AS [# of Products] FROM Products
WHERE (Products.category="100") GROUP BY Products.category, Products.type;

to get

category type # of Products
100 Doll 2
100 Drawing 1
100 Toy 1

mrLee said...

of course, I missed your clue entirely in the question verbage, and I had forgotten about the DISTINCT keyword. All my SQL knowledge comes from years of practical use building queries on small PC dbases, not on large dbase apps like ORACLE, so I usually used the grid query design tools and didn't even need to know the language. It was only afterwards that I took an interest in the SQL code that the grid environment built. The DISTINCT keyword is not even available in MS Access unless one goest thru the unusual query properties dialogue box available by right clicking in an unexpected area of the grid environment.

Nonetheless, one could generate the simple result (no counts) with cleaner code using

SELECT DISTINCT Products.category, Products.type FROM Products WHERE (Products.category="100")

No idea what statement is actually more efficient, but my gut says this one is. The GROUP BY clause must invoke a more involved routine to stack each distinct version on the grouped fields, while keeping each individual occurence until later, in prep for the user wanting to count or do some other aggregate function on each group. (e.g. both types of dolls, so it is prepared to average or count something about each record group). Of course, when I omit any aggregate function result as a field, it can then collapse each group to the same result the the DISTINCT keyword causes, so I suspect this last post is the most efficient approach

Benjamin P Lee said...

well, you can be happy that your thought process mirrored my own except that you lost site that I need the "distinct" rows AND a single candidate ID

Therefore, the result of the perfect query on my original data would have been:

ID - TYPE
1 - TOY
(2 xor 5) - Doll (2 or 5 but not both)
3 - Drawing

I have read in several places (none quoting real proof) that the DISTINCT keyword is equivelant to a GROUP BY with all columns selected.

The problem with both of these aproaches is that if you include the ID in your distinction clause, EVERY SINGLE row is then distinct b/c the ID is a unique primary key!

This is what bothered me. What I wanted was the ability to say that I wanted columns X, Y and Z and Oh, By the way .... I don't want any rows with the same Y as another row and you (the DB) can pick whichever row you want to return.

I can see that this "could" be a hard request or at the very least one that is not often needed or is not easily implented with other internal RDMS systems ... but it doesn't seem too outrages. Does it?

My final solution b/c it worked and I didn't want to keep thinking about it and was similar to the the following:

SELECT TYPE, MIN(ID)
FROM PRODUCTS
WHERE CATEGORY = "100"
GROUP BY TYPE


This worked for me b/c all I wanted was the unique field and another column which happened to be a numeric field. But what if you wanted all unique types returned as a candidate row with other data, or if your other data wasn't a number.

If anyone can think of a better solution that doesn't result in a very complicated query (given the simplicity of the business requirement) and/or returns more data than needed and requires post processing ... I would love to hear it.