informatique -solutions series: SQL SELECT

Showing posts with label SQL SELECT. Show all posts

Wednesday, December 1, 2010

The Really Cool NTILE() Window Function

If you regularly code queries and have never been
introduced to the windowing functions, then
you are in for a treat. I've been meaning to write
about these for over a year, and now it's time to get
down to it.

Support in Major Servers

SQL Server calls these functions
>Ranking Functions.

PostgreSQL supports a wider range of functions
than MS SQL Server, having put them in at
8.4, and PostgreSQL and calls them
>Window Functions.

Oracle's support is broader (by a reading of the docs)
than SQL Server or PostgreSQL, and they call them
>Analytic Functions.

I try to stay away from MySQL, but I did a quick Google on
all three terms and came up with a few forum posts asking
when and if they will be supported.

The NTILE() Function

In this post we are going to look at NTILE, a cool function
that allows you to segment query results into groups and
put numbers onto them. The name is easy to remember because
it can create any -tile, a percentile, a decile, or anything
else. In short, an n-tile. But it is much easier to
understand with an example, so let's go right to it.

Finding percentiles

Consider a table of completed sales, perhaps on an eCommerce site.
The Sales Manager would like them divided up into quartiles,
four equally divided groups, and she wants the average and
maximum sale in each quartile. Let's say the company is not
exactly hopping, and there are only twelve sales, which is good
because we can list them all for the example. If we already
had the quartiles provided then the query would be easy, so if
we were lucky enough to be starting with this:


 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 DEALER   |  500.00 |   3
 RETAIL   |  738.00 |   3
 DEALER   | 1250.00 |   3
 RETAIL   | 2029.00 |   4
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

The query would be child's play if we already
had the quartile:


Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM ORDERS
 GROUP BY quartile
 ORDER BY quartile

The Problem is We Do Not Have Quartile

The problem of course is that we do not usually
have handy columns like QUARTILE provided, but
we can generate the QUARTILE column during the
query by using NTILE.


Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY quartile
 ORDER BY quartile

This query will give us what the Sales Manager wants.

Dissecting the Function and The OVER Clause

The NTILE() function takes a single argument, which tells
the server how many groups to divide the data into. If
there are not an exact number of rows in each group, the
server decides which groups will be missing one row. So
in an exact case all of your groups have the same count of
rows, but when it does not divide evenly, one or more of them
will be one row short.

If you pass 100 to NTILE(), you get a percentile. If you
pass 10, you get a decile, and so forth.

The magic is in the OVER() function. This supports two clauses,
and the example shows one, the ORDER BY. Quite simply, the
ORDER BY clause tells the server how to line up the rows when
adding the NTILE values. The clause is very flexible, and has
nothing to do with your query's overall ORDER BY clause.

The Second Clause: PARTITION

Now we will pretend the Sales Manager is not satisfied, and
wants separate numbers for the two Customer Types. We could
do this if the NTILE() function would create two sets
of quartiles, one for each Customer Type, like so:


 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 DEALER   |  500.00 |   3
 DEALER   | 1250.00 |   4
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 RETAIL   |  738.00 |   3
 RETAIL   | 2029.00 |   3
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

We can do this by using the PARTITION BY clause,
which tells the server to break the rows into
groups and apply the NTILE() numbering separately
within each group. The new query would be this:


Select custtype
     , quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (partition by custtype
                                 order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY custtype,quartile
 ORDER BY custtype,quartile

Bonus Points: The Median

Now once again the Sales Manager, who is never satisified,
comes down and says that the average is no good, she
needs the max and the median sale value within each quartile.
To keep it simple, she does not need this broken out
by customer type, it can be applied to the entire set.

This is a case where we can use NTILE() twice. The first
time we will break all sales up into four groups, to get
the quartiles, and then we will break up each quartile into
two groups to get the median. The code looks like this:


Select quartile
     , max(case when bitile=1 then amount else 0 end) as medAmount
     , max(amount) as maxAmount
  FROM (
        -- The second pass adds the
        -- 2-tile value we will use to find medians
        SELECT quartile
             , amount
             , ntile(2) over (partition by quartile
                                  order by amount) as bitile
          FROM (
                -- The subquery is necessary
                -- to process all rows and add the quartile column
                SELECT amount
                     , ntile(4) over (order by amount) as quartile
                  FROM ORDERS
               ) x1
       ) x2
 GROUP BY quartile
 ORDER BY quartile

The magic here is that we know we've divided the data
evenly into four sets, so the median will be the maximum
value half way through each set. In other words, it will be the
maximum value when the value of bitile=1 for each quartile.

One More Note About Oracle

Once you get down the basics of the OVER clause, Oracle
looks really good, because they support the clause over
the largest range of functions, at least going by the
respective doc pages for each platform.

Saturday, November 6, 2010

Recursive Queries with Common Table Expressions

This week The Database Programmer returns after almost 18 months
with an entry on using Common Table Expressions (CTEs) to do
recursive queries. Relational databases were plagued from their
inception with a lack of meaningful treatment for recursive
operations, and CTEs have finally plugged that hole.

Common Table Expressions appeared in SQL Server 2005, and in
PostgreSQL 8.4, and are also available in Oracle. As for
mySQL, since I don't use it, I did a quick Google search and
looked at the Docs for 5.5, and couldn't really find
anything. I generally tend to assume mySQL cannot do the
tough stuff.

But First, A Rant

There have always been plenty of people who claimed SQL was
a bloated and clumsy language to work with. Most of the time
I tend to agree, but I find the advantages of relational/SQL
system to be so large that I'm willing to pay that price.

But with Commom Table Expressions (CTEs) I just can't help
drifting into conspiracy theories involving the enemies of
SQL infiltrating the committees and deliberately suggesting
the most twisted, bloated, complicated way they could think
of to do what is really a very basic operation. In other
words, I am profoundly unimpressed with the syntax of CTEs,
but as long as they are here and they work, we'll go along.

The Basic Example

Your basic recursive table contains a foreign key to itself,
so that some rows in the table are children of some other
row in the table. This recursion can nest to any depth,
and the chart below shows a very simple example:


Primary_key   |   Parent_Key  |  Notes  
--------------+---------------+---------------------------
     A        |     null      |   top level row, no parent
     B        |      A        |   first level child of A
     C        |      B        |   child of B, grandchild
              |               |   of A
     D        |      C        |   child of C, grandchild 
              |               |   of B, great-grandchild 
              |               |   of A
     X        |     null      |   top level row, no parent
     Y        |      X        |   child of X
     Z        |      Y        |   child of Y, grandchild 
              |               |   of X

What we want is a query that can return a given row
and all of its children out to any level, including
helpful information about the structure of the recursion,
something like this:


Primary_key | Level | Top_Key | Immediate_Parent_key 
------------+-------+---------+-----------------------
     A      |   1   |  A      | null
     B      |   2   |  A      | A    
     C      |   3   |  A      | B   
     D      |   4   |  A      | C
     X      |   1   |  X      | null
     Y      |   2   |  X      | X   
     Z      |   3   |  X      | Y

And Another Rant

At this point the mind boggles at how long this blog entry
needs to be to explain this simple operation. But lets
get going anyway.

The First Step and Last Step

A Common Table Expression begins with the "WITH" clause
and ends with a standard SQL Select:


;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
  ....we'll get to this below....
)
select * from myCTEName

The basic idea is that we are going to define a CTE with
a name and a list of columns, and then SELECT out of it.
Without that final SELECT statement the CTE does not actually
do anything. The SELECT can also be arbitrarily complex, doing
aggregrations, WHERE clauses and anything else you might need.

The first thing to notice is the leading semi-colon. This is
a trick adopted by MS SQL Server users. SQL Server does not
require statements to be terminated with a semi-colon, but a
SQL Server CTE requires the previous statement to have
been terminated with a semi-colon (nice huh?). So SQL Server
programmers adopted the strategy of starting the CTE with
a semi-colon, which keeps the syntactical requirement with
the CTE, where it belongs.

A given CTE sort of has a name. That is, you have to name
it something, but think of it as a table alias in a SQL SELECT,
such as "Select * from myTable a JOIN otherTable b...", it
exists only during the execution of the statement.

The columns listed in the parantheses can have any names
(at least in SQL Server). But these column names are what
you will refer to in the final SQL SELECT statement.

Coding The Inside of the CTE, Step 1

Now we code the inside of the CTE in two steps.
The first step is called the "anchor", and it is a
straightforward query to find the top-level rows:


;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key   as primary_key
         , 1             as level
         , primary_key   as top_key
         , null          as immediate_parent_key
      from myRecursiveTable
     where Parent_key is null
)
select * from myCTEName

This should be self-explanatory, we are querying only for
rows that have no parent (WHERE Parent_key is null) and we
are hardcoding the "level" column to 1, and we are also
hardcoding the "immediate_parent_key" column to null.

This query alone would return two of the rows from
our desired output:


Primary_key | Level | Top_Key | Immediate_Parent_key 
------------+-------+---------+-----------------------
     A      |   1   |  A      | null
     X      |   1   |  X      | null

Coding The Inside of the CTE, Step 2

Now we are going to add the actual recursion. When I first
learned CTEs this was the hardest part to figure out, because it
turned out my hard-won set-oriented thinking was actually slowing me
down, I had to think like a procedural programmer when defining
the second half of the query.


;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key,1,primary_key,null
      from myRecursiveTable
     where Parent_key is null
    UNION ALL
    select chd.primary_key,par.level+1,par.top_key,chd.parent_key
      FROM myCTEName        par
      JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select * from myCTEName

Thinking step-wise, here is what is going on under the hood:

The server executes the "anchor" query, generating a
result set called "myCTEName" containing just the
top level rows.
The server then executes the second query. At this
point the result set "myCTEName" exists and can be
referenced, so that you can link children to their
parents. (That's why you see the JOIN)
Step 2 is repeated recursively, adding grand-children,
great-grand-children, and so on, until no more rows
are being added, at which point it stops, and...
The final result set is passed to the trailing
SELECT, which pulls results out of "myCTEName"
as if it were a table or view.

So when we code the 2nd part of the inside of the CTE, the
part after the UNION ALL, act as if the first query has
already run and produced a table/view called "myCTEName"
that can be referenced. Once you understand that, the
query is pretty easy to understand:

The "From myCTEName par" clause tells us we are pulling
from the previously generated set. I like to use the alias
"par" for "parent" to remind myself that the prior result is
the parent row.
We then join to the original source table and use the
alias "chd" to remind ourselves we are pulling child rows
from there. The "ON chd.parent_key = par.primary_key"
defines how children are joined to parents.
Our first column, "chd.primary_key", is the unique
key for the results.
Our second column, "par.level+1" gives us a nifty
automatically incremented "level" column.
Our third column, "par.top_key" ensures that all rows
contain a reference to their top-most parent.
Our final column, "chd.parent_key", makes sure each
row contains a reference to its immediate parent.

Finding Various Statistics

Once you have the inside of the CTE coded, the fun part
moves to the final SELECT, which is operating on the
complete set of results. You do not necessarily have to pull
the complete list. For instance, you may want to find out
the maximum nesting level for each parent, or the count
of children for each parent:


;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key,1,primary_key,null
      from myRecursiveTable
     where Parent_key is null
    UNION ALL
    select chd.primary_key,par.level+1,par.top_key,chd.parent_key
      FROM myCTEName        par
      JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select top_key
     , max(level) as maxNestingLevel
     , count(*)   as countRows
     , count(*)-1 as countChildren
  from myCTEName

Conclusion

Common Table Expressions give SQL-based databases the (very)
long-needed ability to execute recursive queries, albeit with
a rather complex syntax. Once you grasp the basics of how
to code them, there are many possible uses that go far beyond
the simple example shown here.