If you regularly code queries and have never been
introduced to the windowing functions, then
you are in for a treat. I've been meaning to write
about these for over a year, and now it's time to get
down to it.
Support in Major Servers
SQL Server calls these functions
>Ranking Functions.
PostgreSQL supports a wider range of functions
than MS SQL Server, having put them in at
8.4, and PostgreSQL and calls them
>Window Functions.
Oracle's support is broader (by a reading of the docs)
than SQL Server or PostgreSQL, and they call them
>Analytic Functions.
I try to stay away from MySQL, but I did a quick Google on
all three terms and came up with a few forum posts asking
when and if they will be supported.
The NTILE() Function
In this post we are going to look at NTILE, a cool function
that allows you to segment query results into groups and
put numbers onto them. The name is easy to remember because
it can create any -tile, a percentile, a decile, or anything
else. In short, an n-tile. But it is much easier to
understand with an example, so let's go right to it.
Finding percentiles
Consider a table of completed sales, perhaps on an eCommerce site.
The Sales Manager would like them divided up into quartiles,
four equally divided groups, and she wants the average and
maximum sale in each quartile. Let's say the company is not
exactly hopping, and there are only twelve sales, which is good
because we can list them all for the example. If we already
had the quartiles provided then the query would be easy, so if
we were lucky enough to be starting with this:
CUSTTYPE | AMOUNT | QUARTILE
----------+---------+----------
RETAIL | 78.00 | 1
RETAIL | 234.00 | 1
DEALER | 249.00 | 1
DEALER | 278.00 | 2
RETAIL | 392.00 | 2
RETAIL | 498.00 | 2
DEALER | 500.00 | 3
RETAIL | 738.00 | 3
DEALER | 1250.00 | 3
RETAIL | 2029.00 | 4
RETAIL | 2393.00 | 4
RETAIL | 3933.00 | 4
The query would be child's play if we already
had the quartile:
Select quartile
, avg(amount) as avgAmount
, max(amount) as maxAmount
FROM ORDERS
GROUP BY quartile
ORDER BY quartile
The Problem is We Do Not Have Quartile
The problem of course is that we do not usually
have handy columns like QUARTILE provided, but
we can generate the QUARTILE column during the
query by using NTILE.
Select quartile
, avg(amount) as avgAmount
, max(amount) as maxAmount
FROM (
-- The subquery is necessary
-- to process all rows and add the quartile column
SELECT amount
, ntile(4) over (order by amount) as quartile
FROM ORDERS
) x
GROUP BY quartile
ORDER BY quartile
This query will give us what the Sales Manager wants.
Dissecting the Function and The OVER Clause
The NTILE() function takes a single argument, which tells
the server how many groups to divide the data into. If
there are not an exact number of rows in each group, the
server decides which groups will be missing one row. So
in an exact case all of your groups have the same count of
rows, but when it does not divide evenly, one or more of them
will be one row short.
If you pass 100 to NTILE(), you get a percentile. If you
pass 10, you get a decile, and so forth.
The magic is in the OVER() function. This supports two clauses,
and the example shows one, the ORDER BY. Quite simply, the
ORDER BY clause tells the server how to line up the rows when
adding the NTILE values. The clause is very flexible, and has
nothing to do with your query's overall ORDER BY clause.
The Second Clause: PARTITION
Now we will pretend the Sales Manager is not satisfied, and
wants separate numbers for the two Customer Types. We could
do this if the NTILE() function would create two sets
of quartiles, one for each Customer Type, like so:
CUSTTYPE | AMOUNT | QUARTILE
----------+---------+----------
DEALER | 249.00 | 1
DEALER | 278.00 | 2
DEALER | 500.00 | 3
DEALER | 1250.00 | 4
RETAIL | 78.00 | 1
RETAIL | 234.00 | 1
RETAIL | 392.00 | 2
RETAIL | 498.00 | 2
RETAIL | 738.00 | 3
RETAIL | 2029.00 | 3
RETAIL | 2393.00 | 4
RETAIL | 3933.00 | 4
We can do this by using the PARTITION BY clause,
which tells the server to break the rows into
groups and apply the NTILE() numbering separately
within each group. The new query would be this:
Select custtype
, quartile
, avg(amount) as avgAmount
, max(amount) as maxAmount
FROM (
-- The subquery is necessary
-- to process all rows and add the quartile column
SELECT amount
, ntile(4) over (partition by custtype
order by amount) as quartile
FROM ORDERS
) x
GROUP BY custtype,quartile
ORDER BY custtype,quartile
Bonus Points: The Median
Now once again the Sales Manager, who is never satisified,
comes down and says that the average is no good, she
needs the max and the median sale value within each quartile.
To keep it simple, she does not need this broken out
by customer type, it can be applied to the entire set.
This is a case where we can use NTILE() twice. The first
time we will break all sales up into four groups, to get
the quartiles, and then we will break up each quartile into
two groups to get the median. The code looks like this:
Select quartile
, max(case when bitile=1 then amount else 0 end) as medAmount
, max(amount) as maxAmount
FROM (
-- The second pass adds the
-- 2-tile value we will use to find medians
SELECT quartile
, amount
, ntile(2) over (partition by quartile
order by amount) as bitile
FROM (
-- The subquery is necessary
-- to process all rows and add the quartile column
SELECT amount
, ntile(4) over (order by amount) as quartile
FROM ORDERS
) x1
) x2
GROUP BY quartile
ORDER BY quartile
The magic here is that we know we've divided the data
evenly into four sets, so the median will be the maximum
value half way through each set. In other words, it will be the
maximum value when the value of bitile=1 for each quartile.
One More Note About Oracle
Once you get down the basics of the OVER clause, Oracle
looks really good, because they support the clause over
the largest range of functions, at least going by the
respective doc pages for each platform.
No comments:
Post a Comment