informatique -solutions series: 2010

Wednesday, December 22, 2010

A Working Definition of Business Logic, with Implications for CRUD Code

Update: the >Second Post of this series is now available.

Update: the >Third Post of this series is now available.

The >Wikipedia entry on "Business Logic" has a
wonderfully honest opening sentence stating
that "Business logic,
or domain logic, is a non-technical term...
(emphasis mine)". If this is true, that the term
is non-technical, or if you like, non-rigorous,
then most of us spend the better part of our efforts
working on something that does not even have a definition.
Kind of scary.

Is it possible to come up with a decent working
definition of business logic? It is certainly
worth a try. This post is the first in a four
part series. The second post is about
a more rigorous definition of Business Logic.

This blog has two tables of contents, the
Complete Table of Contents and the list
of
Database Skills.

The Method

In this essay we will pursue a method of finding
operations that we can define as business logic
with a minimum of controversey, and identify those
that can likely be excluded with a minimum of
controversey. This may leave a bit of gray area
that can be taken up in a later post.

An Easy Exclusion: Presentation Requirements

If we define Presentation Requirements
as all requirements about "how it looks"
as opposed to "what it is", then we can rule
these out. But if we want to be rigorous
we have to be clear, Presentation Requirements
has to mean things like branding, skinning,
accessibility, any and all formatting, and
anything else that is about the appearance
and not about the actual values fetched from
somewhere.

Tables as the Foundation Layer

Database veterans are likely to agree that your
table schema constitutes the foundation layer of all
business rules. The schema, being the tables,
columns, and keys, determines what must be
provided and what must be excluded.
If these are not business logic, I guess I don't
know what is.

What about CouchDB and MongoDB and others that do
not require a predefined schema? These systems
give up the advantages of a fixed schema for
scalability and simplicity. I would argue here
that the schema has not disappeared, it has simply
moved into the code that writes documents to the
database. Unless the programmer wants a nightmare
of chaos and confusion, he will enforce some document
structure in the code, and so I still think it safe
to say that even for these databases there is a
schema somewhere that governs what must be
stored and what must be excluded.

So we have at least a foundation for a rigorous
definition of business rules: the schema, be it
enforced by the database itself or by the code,
forms the bottom layer of the business logic.

Processes are Business Logic

The next easy addition to our definition of
business logic would be processes, where
a process can be defined loosely as anything
involving multiple statements, can run without
user interaction, may depend on parameters
tables, and may take longer
than a user is willing to wait, requiring background
processing.

I am sure we can all agree this is business logic,
but as long as we are trying to be rigorous, we
might say it is business logic because:

It must be coded
The algorithm(s) must be inferred from the requirements
It is entirely independent of Presentation Requirements

Calculations are Business Logic

We also should be able to agree that calculated
values like an order total, and the total after
tax and freight, are business logic. These are
things we must code for to take user-supplied
values and complete some picture.

The reasons are the same as for processes, they
must be coded, the formulas must often be inferred
from requirements (or forced out of The Explainer
at gunpoint), and the formulas are completely
independent of Presentation Requirements.

The Score So Far

So far we have excluded "mere" Presentation
Requirements, and included three entries I hope
so far are non-controversial:

Schema
Processes
Calculations

These are three things that some programmer must
design and code. The schema, either in a
conventional relational database or in application
code. Processes, which definitely must be
coded, and calculations, which also have to be
coded.

What Have We Left Out?

Plenty. At very least security and notifications.
But let's put those off for another day and see
how we might handle what we have so far.

For the Schema, I have already mentioned that you
can either put it into a Relational database or
manage it in application code when using a "NoSQL"
database. More than that will have to wait for
2011, when I am hoping to run a series detailing
different ways to implement schemas. I'm kind of
excited to play around with CouchDB or MongoDB.

For processes, I have a >separate post that examines the implications
of the stored procedure route, the embedded SQL route,
and the ORM route.

This leaves calculations. Let us now see how
we might handle calculations.

Mixing CRUD and Processes

But before we get to CRUD, I should state that
if your CRUD code involves processes,
seek professional help immediately.
Mixing processes into CRUD is an extremely
common design error, and it can be
devastating. It can be recognized
when somebody says, "Yes, but when the salesman
closes the sale we have to pick this up and move
it over there, and then we have to...."

Alas, this post is running long already and so
I cannot go into exactly how to solve these, but
the solution will always be one of these:

Spawning a background job to run the process
asynchronously. Easy because you don't have to
recode much, but highly suspicous.
Examining why it seems necessary to do so
much work on what ought to be a single INSERT
into a sales table, with perhaps a few extra
rows with some details. Much the better solution,
but often very hard to see without a second pair
of eyes to help you out.

So now we can move on to pure CRUD operations.

Let The Arguments Begin: Outbound CRUD

Outbound CRUD is any application code that
grabs data from the database and passes it
up to the Presentation layer.

A fully normalized database
will, in appropriate cases, require business logic of the
calculations variety, otherwise the
display is not
complete and meaningful to the user.
There is really no
getting around it in those cases.

However, a database Denormalized With
Calculated Values requires no business logic
for outbound CRUD, it only has to pick up what
is asked for and pass it up. This is the route
I prefer myself.

Deciding whether or not to include denormalized
calculated values has heavy implications for
the architecture of your system, but before we
see why, we have to look at inbound CRUD.

Inbound CRUD

Inbound CRUD, in terms of business logic, is
the mirror image of outbound. If your
database is fully normalized, inbound CRUD
should be free of business logic, since it
is simply taking requests and pushing them to
the database. However, if you are denormalizing
by adding derived values, then it has to be
done on the way in, so inbound CRUD code must
contain business logic code of the calculations
variety.

Now let us examine how the normalization
question affects system architecture and
application code.

Broken Symmetry

As stated above, denormalizing by including
derived values forces calculated business
logic on the inbound path, but frees your
outbound path to be the "fast lane".
The opposite decision, not storing calculated
values, allows the inbound path to be the "fast lane"
and forces the calculations into the outbound
path.

The important conclusion is: if you have business
logic of the calculation variety in both lanes
then you may have some inconsistent practices,
and there may be some gain involved in sorting
those out.

But the two paths are not perfectly symmetric.
Even a fully normalized database will often,
sooner or later, commit those calculated values
to columns. This usually happens when some
definition of finality is met. Therefore, since
the inbound path is more likely to contain calculations
in any case, the two options are not really
balanced. This is one reason why I prefer
to store the calculated values and get them right
on the way in.

One Final Option

When people ask me if I prefer to put business
logic in the server, it is hard to answer without
a lot of information about context. But when
calculations are involved the answer is yes.

The reason is that calculations are incredibly
easy to fit into patterns. The patterns themselves
(almost) all follow foreign keys, since the foreign
key is the only way to correctly relate data between
tables. So you have the "FETCH" pattern, where a
price is fetched from the items table to the cart,
the "EXTEND" pattern, where qty * price = extended_Price,
and various "AGGREGATE" patterns, where totals are
summed up to the invoice. There are others, but it
is surprising how many calculations fall into these
patterns.

Because these patterns are so easy to identify, it
is actually conceivable to code triggers by hand
to do them, but being an incurable toolmaker, I
prefer to have a code generator put them together
out of a data dictionary. More on that around the
first of the year.

Updates

Update 1: I realize I never made it quite clear that this is part
1, as the discussion so far seems reasonable but is
hardly rigorous (yet). Part 2 will be on the way after I've
fattened up for the holidays.

Update 2: It is well worth following the link Mr. Koppelaars has put in the comments:
http://thehelsinkideclaration.blogspot.com/2009/03/window-on-data-applications.html

Sunday, December 19, 2010

User-Submitted Analysis Topic: Email

Reader Dean Thrasher
of Infovark has submitted
a schema for review and analysis as part of my
>User-Submitted Analysis Request series.
Today we are going to take a first look
at what he has. Mr. Thrasher and I both hope that any and all readers
will benefit from the exercise of publicly reviewing the schema.

This particular analysis request is a great start to the series,
because it has to do with email. Everybody uses email so we all
understand at a very basic level what data will be handled.

Brief Introduction to User-Submitted Schemas

Mr. Thrasher and I have exchanged a couple of emails, but we have
avoided any in-depth discussion. Instead, we want to carry out the
conversation on the public blog. So I am not aiming to provide
any "from on high" perfect analysis, instead this essay will contain
a lot of questions and suggestions, and we will then move into the
comments to go forward.

Disclosure: None. We are not paying each other anything, nor have
I received any merchandise that would normally carry a licensing fee.

Today's essay is the very first in the
>User-Submitted Anlaysis Requests series. If you would like to see an analysis
of your schema, follow that link and contact me.

This blog has a Complete Table of Contents and a list
of Database Skills.

Brief Description and Starting Point

To get us started, I am going to quote the >Infovark Product Page, and then we will see what we want to zoom in on:

Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date

So we know even before we see anything technical that we are going to have
tables of contacts, emails, phones, addresses, appointments and many other
things pulled in from email systems, plus the value-add provided by the
product.

The Schema As-Is

We are going to start by looking at how the details of a CONTACT
are stored. The schema models contacts with a group of
cross references, aka many-to-many relationships, like so:


CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

The first thing we have to note is that there is nothing wrong
with this at all. It is fully normalized and so it will be
very easy to make sure that database writes will not produce
anomalies or bad data.

But, not surprisingly, Mr. Thrasher notes this makes for complicated
SELECTS, so we want to ask if
perhaps it is over-normalized, are there complications
in there that do not need to be there?

Email as A Property of Contact

If I were to
>follow my own advice, I would first want to identify the
master tables. Master tables generally represent real things in the
world: people, places, products, services, events.

So my first question is this: is an email address a free-standing
entity in its own right that deserves a master table? Or is it
instead a property of the CONTACT? I am going to suggest that an
email address is a property of a CONTACT, and, since a CONTACT
may have more than one email address, they should be stored in
a child table of the CONTACTS, more like this:


CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         +----- CONTACTEMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

Whether or not we make this switch depends not on
technical arguments about keys or data types, but on
whether this accurately models reality. If in fact
email addresses are simply properties of contacts, then
this is the simplest way to do it. Going further, the
code that imports and reads the data will be easier to
code, debug and maintain for two reasons: one, because
it is simpler, but more importantly, two, because it
accurately models reality and therefore will be easier
to think about.

If this proves to be the right way to go, it may be
a one-off improvement, or it may repeat itself for
Phones, Addresses, and Web Addresses, but we will take
that up in the next post in the series.

I am going to proceed as if this change is correct, and
ask then how it will ripple through the rest of the
system.

Some Specifics on the Email Addresses Table

The EMAILADDRESSES table currently has these columns:

-- SQL Flavor is Firebird
CREATE TABLE EMAILADDRESS (
  ID           INTEGER NOT NULL,
  USERNAME     VARCHAR(64) NOT NULL COLLATE UNICODE_CI,
  HOSTNAME     VARCHAR(255) NOT NULL COLLATE UNICODE_CI,
  DISPLAYNAME  VARCHAR(255) NOT NULL
);

ALTER TABLE EMAILADDRESS
  ADD CONSTRAINT PK_EMAILADDRESS
  PRIMARY KEY (ID);

CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
  IF (NEW.ID IS NULL) THEN
  NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1);
END^

Suggestion: The first thing I notice is that the
complete email itself is not actually stored. So we need to
ask Mr. Thrasher what the thinking behind that was. My first
instinct is to store that, because it is the original natural
value of interest.

Suggestion: The columns USERNAME and HOSTNAME I could
go either way on. If they are needed for querying and statistics,
it is better to put them in. While this violates 3rd Normal Form and
so puts us at risk, the values are supplied AFAIK by a batch import,
and so there is only one codepath populating them, and we are likely safe.
However, if we DO NOT need to query these values for statistics,
and they are only there for convenience at display time, I would
likely remove them and generate them on-the-fly in application
code. There are some other good reasons to do this that will
come up a little further along.

Suggestion: Unless I missed something in the schema
sent over, we need a unique constraint on the combination of
CONTACTID and USERNAME and HOSTNAME. Or, if we remove
USERNAME and HOSTNAME in favor of the original EMAILADDRESS,
we need a unique constraint on CONTACTID + EMAILADDRESS.

Before We Get To Transactions

We are about to go into Part 2, which is about the other
tables that reference EMAILADDRESSES, but before we do
let's look at what the two tables would be if we made all
changes suggested so far:


 CONTACTS             EMAILADDRESSES 
------------         --------------------
                      ID            (surrogate key)
 CONTACT_ID --------& CONTACT_ID
 other columns...     EMAILADDRESS  
                      LABEL
                      USERNAME      (possibly removed)
                      HOSTNAME      (possibly removed)
                      DISPLAYNAME

You may notice the LABEL column showed up out of nowhere.
That column was previously in the cross-reference. When
the cross-reference went away it landed in EMAILADDRESSES.
That column LABEL holds values like "work", "home" and
so on. It is supplied from whatever system we pull
emails from, and so we have no constraints on it or
rules about it.

Changing Emails And Transactions

Now we move on from the basic storage of EMAIL addresses
to the other tables that reference those addresses.
These are things like emails themselves with their lists
people sent to/from, and meetings, and presumably other
types of transactions as well.

When we look at transactions, which will reference
contacts and email addresses, we also have to consider
the fact that a CONTACT may change their email address
over time. Consider a person working for Vendor A, who
moves over to Vendor B. For some of the transactions
they will have been at Vendor A, and then going forward
they are all at Vendor B. This leads to this very
important question:

Do Transactions store details about the CONTACTS as
they were at the time of the transaction, or as they
are now?

In other words, if a CONTACT moves from one company to
another, and you look at a meeting with that person
from last year, should it link to where they are now?
Or should it be full of information about where they
were at the time?

The answer to this question is important because it
determines how to proceed on the two final points I
would like to raise:

Should the various transactions have a foreign
key back to EMAILADDRESSES, or should they simply
link back to CONTACTS and contain the EMAILADDRESS
itself?
Do we need an integer surrogate key on the
EMAILADDRESSES table, especially if we do not link
back to it?

First Final Suggestion

So the first of the final two suggestions is: maybe
the transactions tables should just link back to CONTACTID
and contain a freestanding EMAILADDRESS. The first argument for
this is that it preserves the history as it was, and
if that is what we want, then this accomplishes it.
The second argument is that by putting the actual value
instead of an integer key back to some table, we
simplify coding by removing a join.

The arguments against embedding the email address might
be basically, "hey, if this is a kind of a data warehoues,
you are really supposed to be doing the snowflake thing and
you don't want to waste space on that value." To which I
respond that the engineer always has the choice of trading
space for speed. Putting the email in directly is a correct
recording of a fact, and takes more space, but eliminates
a very common JOIN from many queries, so Mr. Thrasher may
choose to make that call.

This also plays back to my question about whether we should
have USERNAME and HOSTNAME in the EMAILADDRESSES table.
If we start putting email addresses directly into tables,
we can also keep putting these other two columns in, which
trades again space for speed. We could also skip them and
code a parser in the application that generates them
on-the-fly as needed.

Second Final Suggestion

Now we go all of the way back to the child table
and ask a basic question: Why is there is an integer
surrogate key there? Integer surrogate keys are
useful in many situations, but contrary to what the
web generation learned, they are not some kind of
required approach in relational databases.

Consider: we need a unique constraint on CONTACTID+EMAILADDRESS
anyway, so we have to justify why we would add a new
column that does not add value. The reflex answer tends
to be "because they join faster" but that ignores the fact
that if you use the natural key of CONTACTID+EMAILADDRESS,
and put these columns into child tables, you do not need
to join at all! If we use the surrogate key and embed
it in child tables, then getting the CONCTACT information
forces two joins: through EMAILADDRESS to CONTACTS. But if
we use the natural key of CONTACTID + EMAILADDRESS we
already have the contact id which saves a JOIN when we
are after CONTACTS details, and, unless we want to know
something like LABEL, we do not have to JOIN back to
EMAILADDRESSES at all.

Conclusion

Well that's it. As promised, we have a few suggestions
and a lot of questions for Mr. Thrasher. Check back in the
coming days to see how the various questions work themselves
out in the comments.

Friday, December 17, 2010

Critical Analysis of an Algorithm: Sproc, Embedded SQL, and ORM

This is a follow-up to yesterday's
>historical perspective
on ORM. In this essay we examine a particular class of
business logic and ask what happens if we go server-side,
embedded SQL, or ORM.

This blog has two tables of contents, the
Complete Table of Contents and the list
of
Database Skills.

Processes

We are going to look at a process. The term is not
well defined, but my own working definition is any
operation that has as many of the following properties
as I seem to think are important at the time I make
the decision:

Is categorically not CRUD: requires much more
than displaying data to user or sending a single-row
operation to the server
Involves reading or writing many rows
Involves reading or writing from multiple tables
Involves multiple passes of the same data
Involves no user interaction while executing
If not coded to be idempotent can cause huge
headaches
Will likely take longer than a user is willing
to wait (these days 3-10 seconds) and so runs in the
background.
Depends upon rules tables that control its behavior

A Particular Process: Magazine Regulation

I have a more complete description of this problem
here, so this is going to be very
short. The system handles magazine deliveries to
stores. The shop running the system has thousands of stores
and thousands of magazines. Every store has a
default quantity of the particular magazines they
carry.
For one particular magazine, NewsTime,
there are 1000 stores that get an average default
quantity of 50, requiring 50,000 magazines each weak.

Here is the twist. You never get exactly 50,000, no
matter what you tell the distributor. Some weeks you
get 45,000, others you get 55,000, with any variation
in between. So the Regulation
Process adjusts the defaults for each store
until the delivery amounts equal the on-hand total that
was delivered on the truck.

The Naive or Simple Algorithm

In the first pass, we are going to consider an
unrealistically simple version of Magazine Regulation,
where we have too many magazines and must up the
quantities until we're giving out the entire amount on-hand.

Assume a table has already been populated that has
the default quantities for each store, where the relevant
columns for the moment would be these:


 StoreId   |  MagazineId   |  QTY_DEFAULT | QTY_REGULATED 
-----------+---------------+--------------+---------------
    1      |      50       |      75      |      0        
    2      |      50       |      23      |      0        
    4      |      50       |      48      |      0        
   10      |      50       |      19      |      0        
   17      |      50       |     110      |      0        
   21      |      50       |      82      |      0

We are told only that the increases must be evenly
distributed, we can't just ship 5000 extra magazines to
a single store. That makes sense. A simple algorithm to do this would be:

Give each store one additional magazine until you
run out of magazines or run out of stores.
Repeat step 1 until you run out of magazines.

The Pseudo-Code

Let's mock something up in pseudo-code that
shows the structure of the solution:


magsToDeliver = get total magazines...
magsDefault   = get total of defaults...

-- Outer loop implements rule 2:
-- "repeat until you run out of magazines"
while magsToDeliver > magsDefault {

   -- Inner loop implements rule 1:
   -- "increase each store by 1 until you
   --  run out of stores or run out of magazines"
   for each store getting this magazine {
        if magsToDeliver <= magsDefault break
   
        -- If you want to go nuts, and even allow
        -- the accidental simultaneous execution
        -- of two routines doing the same thing, 
        -- put these lines here instead
        magsToDeliver = get total magazines...
        magsDefault   = get total of defaults...
   
        -- This is the actual job getting done
        qty_regulate  +=1
        magsToDeliver -=1
   }
}

The Three Methods

Let's characterize what happens with our three
choices of stored procedure, embedded SQL, or
ORM.

Stored Procedure.
Likely the fastest
solution, considering all of that row-by-row
going on. If that were in app code (ORM or not) we would
be making two
>round trips to the server per iteration.

The really crazy thing about the stored procedure
solution is that it is utterly neutral to the
ORM question. The entire ORM good-bad debate
dissolves because there is no OOP code involved.
So this could be the magic solution that ORM lovers
and haters could both agree upon.

App Code With Embedded SQL (no ORM). Just about
everybody hates this idea these days, but it should
be here for completeness, and because there are some
advantages. The top of the pseudo-code requires to
aggregate pulls, and if you are not afraid of SQL you
can pull down the result in one pass, instead of
iterating on the client. Further, the innermost operation
can be coded in SQL as a "UPDATE deliveries from (SELECT
TOP 1 deliverid From deliveries...)" so that you get only
one round trip per iteration, where ORM will cost two.

Any Kind of ORM. By this I mean the code will
contain no SQL, and the innermost loop will likely
instantiate some "delivery" objects, one after another,
increment their qty_regulated property by one, and flush them out.
This is twice as expensive as embedded SQL because you
have to fetch the row from the database and then
write it back, where the embedded SQL can issue a single
command that locates and updates the row in a single
statement.

Some may argue that I misunderstand ORM
here, in that the library may be smart enough to allow
the write without the read, and without forcing you
to embed SQL. It would have to be something like
A) instantiate empty object with key, B) assign value
as an expression, like "+=1", C) save.
I welcome any such examples and will
update the post accordingly if any are forthcoming.
I am assuming that no ORM tool I have seen can do this
and would be happy to be corrected.

If the ORM forces us to calculate the initial sum
of QTY_Default by fetching each row as an object and summing
them in the app, we get an extra complete set of round
trips. Bummer. But if we say, "Hey my ORM tool lets
me embed SQL in *emergencies*" then perhaps we can embed
a query with an aggregrate and skip that cost. But
oops, we've got some embedded SQL. Leaky abstraction.

The Score So Far

So how does it look so far? All methods have
the same number of reads and writes to disk,
so we are scoring
them on round trips. If "X" is the number
of extra magazines to be distributed, and "Y" is
the number of stores getting the magazine, we have
for round trips:

Stored Procedure: 1
Embedded SQL: X + 1 (the first pull plus one trip per
extra copy of the magazine)
ORM, Hypothetical:X + 1 (if the ORM tool can figure out how
to do an update without first reading the row to the app)
ORM, Best Case: 2X + 1 (if the first pull can be an aggregrate
without embedding SQL, and two round trips per iteration)
ORM, Worst Case:2X + Y (if the first pull must aggregate
app-side and there are two round trips per iteration)

Update: if you want a laugh, check out the image on the
Wikipedia page for "Business Logic", it depicts
aggregation occuring on the client side.

This gives us the shape of the engineering decision.
With all options reading and updating the same number of
rows, it call comes down to round trips. As soon as you
go client side your round trips go way up, and if your
ORM tool does not support Update without Select, then
it doubles from there.

Now multiply this across the entire application,
every single action in code, every bit of "business
logic" with any kind of loop that iterates over
rows.

It Gets Worse/Better: Considering SQL Possibilities

If you happen to know much about modern SQL,
you may be aware of the amazingly cool SQL RANK()
function. If this function is used in the sproc
or embedded SQL approaches, you can execute the
algorithm with only one loop, in a maximum of
N=CEILING((Delivered-Regulated)/Stores)
iterations. This will go much faster than the
row-by-row, and now those two options are pulling
even further ahead of the naive row-by-row methods
encouraged by an ORM tool.

This ability of SQL will become extremely
important, as we are about to blow apart the
simplicity of the algorithm.

We Now Return You To The Real World

I have never been paid good money to write an
algorithm as simple as the one described above.
This is because mature businesses have always
refined these simple methods for years or decades,
and so the real situation is always more complex.

In a real magazine regulation algorithm, the rules
tend to be more like this:

Apply all of these rules whether you are
increasing or decreasing the amounts to deliver
Stop applying the rules when delivery amounts
have been balanced to what we have on hand, no matter
where you are in the process
Always increase/decrease any particular store
by exactly one on each iteration, no matter which rule
you are working on
Never decrease any store below 2
Decrease any store whose past 3 issues sold less
than 60% by 1, unless this would project their sales
of this issue above 60%, and prioritize
by prior sales % ascending.
If the previous step completes, and we are
short of magazines decrease each store
by 1 by order of previous sales percents
ascending. Repeat until we are in balance.
If all stores are reduced to 2 and we are
still short, abort and write error to log.
If after the decreases we have excess magazines,
increase any store whose past 3 issues sold more than
70% by 1, unless this would reduce their projected
sales of this issue below 70%, and prioritize by
prior sales % descending (so the stores with the most
sales are handled first in case we don't get to all of them)
If the previous step completes, and we are
still in excess, increase each store by 1 in order
of previous sales percents descending. Repeat until
we are in balance.

This can also get doubled again if we must implement one
set of rules when we start out with a too few magazines,
and another set of rules when we start out with too many.

Well, it's not that hard. It actually comes down to
having four outer loops in succession. The percent-based
reduction, then the by 1 reduction, then the percent-based
increase, then the by 1 increase.

But technically the more important matters are these:

We now have to
grab the sales % for every store for this magazine
on their past 3 issues and keep it handy throughout
the routine.
The rules stated above contain constants
like 70%, 60%. These would properly be in some
parameter table to allow the user to control them,
so those have to be fetched.
The loop through the stores is now much different,
as we are filtering on prior sales percent for
the percent-based passes, and filtering and ordering
on prior sales percent for the by 1 passes.

Revisiting the Three Approaches

Now let's see how our three approaches would change.

The Improved Stored Procedure. If we change the
sproc to use RANK() and make batch updates, we would
pull the prior sales percents into a temp table and
apply a >covering index to cut the reads from that table in
half. Our main loop would then simply join to this
temp table and use it for both filtering and ordering.

The Embedded SQL. If we also changed the
embedded SQL so it was making batch updates with
RANK(), we would also generate a temp table. This
option remains the same as the sproc except for where
we put the SQL. However, it now has far fewer
round trips, and starts to look much more like the
sproc in terms of performance.

The ORM Approach. The idea here would be to
get those prior sales percents down into an ordered
collection and then use them as
the basis for the loops. The thorny part is that they
must be aggregated and sorted. If we want to avoid
all embedded SQL, then the aggregation
can be done client-side if we don't mind pulling down
3 times as many rows as are required. The sorting we
can pull off if we put the objects into a
collection such as an associative array, where the key
is the sales percent, then we can use [language of choice]'s
built-in sorting (hopefully), and we have escaped the
dread evil of embedded SQL.

So we end up where we were, only more so. The sproc
remains the fastest, and if we know how to code set-oriented
nifty stuff with RANK() then the embedded SQL will run
in almost the exact same time. The ORM requires
most likely even more round trips and expensive
app-side operations that are performed much more efficiently
in the db server, unless we are willing to break the
abstraction and embed a bit of SQL.

But in the end, if all of that cost of the ORM kicks
a 3 second routine to 7 seconds, still well below what
any user would notice, and you avoid
embedded SQL, and it lets you keep your paradigm,
who am I to judge?

Conclusions

I offer none. There are so many conventions in play
regarding where to put your code, what tool you are
already using, and so forth, that it is really up to
the reader to draw conclusions. I only hope there is
enough information here to do so.

Thursday, December 16, 2010

Historical Perspective of ORM and Alternatives

A couple of years ago I broke my basic rule of sticking
to practical how-to and general programming philosophy
and wrote >Why I Do Not Use ORM. It sure got a lot of hits,
and is read every day
by people searching such things as "orm bad" or "why use orm".
But I have never been
satisfied with that post, and so I decided to take
another stab from another angle. There are legitimate
problems that led to ORM, and those problems need to
be looked at even if we cannot quite agree on what they
are or if ORM is the answer.

UPDATE: In response to comments below and on reddit.com,
I have a new post that gives a detailed
analysis of an algorithm implemented as a sproc, in app
code with embedded SQL, and in ORM.

Here then, is one man's short history of commercial
database application programming, from long before
the ORM system, right up to the present.

This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.

The Way Back Machine

When I began my career the world was a different place.
No Web, no Java, and Object Orientation had not yet
entered the mainstream. My first
application was written on a timeshare system (a microVAX)
and writing LAN applications made me a good living for
awhile before I graduated to client/server.

In those days there were three things a programmer
(We were not "software engineers" yet, just
programmers) had to know. Every programmer I knew
wanted to master all of these skills. They were:

How to design a database schema for correctness
and efficiency.
How to code an application that could process
data from the database, correctly and efficiently.
How to make a good UI, which came down to
hotkeys and stuffing the screen with as much info
as possible.

In this essay we are going to look at those first two.

My own experience may be somewhat peculiar in that I
have never worked on a team where the programmers
were separated from the database. (OK, one exception, in
my current assignment there is an iron curtain between
the two, but happily it is not my problem from where
I sit). Coders made tables, and "tablers" wrote
code. So this focus on being a good developer by
developing both skills may be rare, enjoyed by those who
have the same ecumenical background that I enjoyed.

Some Changes That Did Not Matter

Things changed rapidly, but most of those changes
did not really affect application development.

When Windows 95 came out, being "almost as good as
a Mac", we recoded our DOS apps into Windows apps
without too much trouble and life went on as before.

Laser printers replaced dot-matrix for most office use,
CPUs kept getting faster (and Windows kept getting
slower), each year there were more colors on the
screen, disks got bigger and RAM got cheaper.

Only the internet and the new stateless programming
required any real adjustment, but it was easy for a
database guy because good practice had always been to
keep your transactions as short as possible. The stateless
thing just kind of tuned that to a sharp edge.

Finally, with the internet, the RDBMS finally lost its
place as sole king of the datastore realm, but those new
datastores will have to wait for another day, lest we
get bogged down.

Enter Object Orientation

Arguably nothing changed programming more than
Object Orientation. Certainly not Windows 95, faster
graphics or any of those other Moore's Law consequences.
I would go so far as to say that even
the explosion of the web just produced more programming,
and of different kinds of apps, and even that did not
come close to the impact of Object Orientation.
Disagree if you like, but as it came in, it was
new, it was strange, it was beautiful, and we were
in love.

Now here is something you may not believe. The biggest
question for those of us already successfully developing
large applications was: What is it good for? What does
it give me that I do not already have? Sure its
beautiful, but what does it do?

User interfaces were for me the easiest first place to
see the benefits. When the widgets became classes and objects,
and we empolyed encapsulation, inheritance and
composition, the world
changed and I don't know anybody who ever looked back.

OOP, Data, and Data Structures

But in the matter of processing data, things were not
so clear cut. The biggest reason may have been that
all languages back then had specialized data structures
that were highly tuned to handling relational data.
These worked so well that nobody at first envisioned
anything like >ActiveRecord because
we just did not need it.

With these structures you could write applications
that ran processes involving dozens of tables, lasting
hours, and never wonder, "Gosh, how do I map this data
to my language of choice?" You chose the language you
were using precisely because it knew how to handle
data!

I would like to throw in just one example to show how
OOP was not relevant to getting work done back then.
I was once asked to optimize something called "ERP
Allocation" that ran once/day, but was taking 26 hours
at the largest customer site, obviously a big problem.
It turned out there was a call to the database inside of
a tightly nested loop, and when I moved the query outside
of the loop the results were dramatic. The programmers
got the idea and they took over from there. The main
point of course is that it was all about how to
efficiently use a database. The language was OOP, and
the code was in a class, but that had nothing to do
with the problem or the solution. Going further,
coding a process so data intensive as this one
using ActiveRecord
was prima facia absurd to anybody who knew about data
and code.

Java and the Languages of The Internet

But the web had another impact that was far
more important than just switching to stateless
programming. This was the introduction
of an entirely new family of languages that took
over the application space, listed here in no
particular order: Perl, PHP,
Python, Ruby, and the king of them all: Java.

All of these languages have one thing in common
that positively jumps out at a veteran: a complete
lack of data structures specialized for handling
relational data.
So as these languages exploded in popularity
with their dismal offerings in data handling,
the need to provide something better in that
area became rapidly clear.

Java has a special role to play because it was
pure OOP from the ground up. Even the whitespace
is an object! The impact of Java is very important
here because Object Orientation was now the One True
Faith, and languages with a more
flexible approach were gradually demoted
to mere 'scripting' languages. (
Of course proponents will quickly point out that 1/12 of the
world's population is now using a single application
written in one of those 'scripting' languages).

So the explosion of languages without decent
data handling abilities, coupled with a rise in
OOP-uber-alles thinking led us quite naturally to:

The First Premise of ORM: The Design Mismatch

The first premise of ORM is that there is a design
mismatch between OOP and Relational, which must resolved
before any meaningful work can be done.

This view is easy to sympathize with, even if you
disagree, when you consider the points raised in the
above sections, that the languages in play lack any real
specialized data structures, and that a certain
exclusive truthiness to OOP has arisen that is blind
to entire classes of solutions.

So we must grant the ORM crowd their first
premise, in modified form. It is not that there
is a design mismatch, it is that there is something
missing, something that was in older systems that
is just not there in the newer languages. Granting
that this missing feature is an actual mismatch
requires a belief in the Exclusive Truth of OOP,
which I do not grant. OOP is like the computer
itself, of which Commander Spock said, "Computers
make excellent servants, but I have no wish to be
servant to a computer."

But anyway, getting back to the story, the race
was on to replace what had been lost, and to do it
in an OOPy way.

The Second Premise of ORM: Persistence

Fast forward and we soon have an entire family
of tools known as Object-Relational-Mappers,
or ORM. With them came an old idea: persistence.

The idea has always been around that databases
exist to persist the work of the programmer.
I thought that myself when I was, oh, about 25 or
so. I learned fast that my view of reality was,
*cough*, lacking,
and that in fact there are two things
that are truly real for a developer:

The users, who create the paycheck, and
The data, which those users seemed to think
was supposed to be correct 100% of the time.

From this perspective, the application code suddenly
becomes a go-between, the necessary appliance that
gets data from the db to the user (who creates the
paycheck), and takes instructions back from the user
and puts them in the database (correctly, thank you,
and don't make the user wait). No matter how
beautiful the code was, the user would only ever see
the screen (or page nowadays) and you only heard about
it if it was wrong. Nobody cares about my code, nobody
cares about yours.

However, in the ORM world the idea of a database as the
persistence layer now sits on a throne reserved for
axiomatic truth. Those who disagree with me on this
may say that I have the mistaken perspective of an outsider,
to which I could say only that it is this very idea that
keeps me an outsider.

But we should not paint the world with a broad brush.
Chris Wong writes an excellent blog where he occassionally
details how to respect the database while using Hibernate, in
this post
and this post.

An Alternative World View

There are plenty of alternatives to ORM, but I would
contend that they begin with a different world view.
Good business recognizes the infinite value of the
users as the generators of the Almighty Paycheck, and
the database as the permanent record of a job well
done.

This worldview forces us into a humble position with
respect to our own application code, which is that it
is little more than a waiter, carrying orders to the
kitchen and food back to the patrons. When we see it
this way, the goal becomes to write code that can
efficiently get data back and forth. A small handful
of library routines can trap SQL injection, validate
types, and ship data off to the database. Another
set can generate HTML, or, can simply pass JSON
data up to those nifty browser client libraries
like ExtJS (now
"Sencha" for some reason).

This covers a huge amount of what an application
does, if you do not have much in the way of
business logic.

But How Do You Handle Business Logic?

I have an entire essay on this about half-written,
but in short, it comes down to understanding what
business logic really is. Update: "http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html">This post is now available

The tables themselves are the bottom layer of
business logic. The table design itself implements
the foundation for all of the business rules.
This is why it is so important to get it right.
The tables are organized using normalization to
have a place for everything and everything in its
place, and after that the application code mostly
writes itself.

The application code then falls into two areas:
value-add and no value-add. There is no value-add
when the application simply ships data off to the
user or executes a user request to update the
database. Those kinds of things should be handled
with the lightest possible library that gets the
job done.

But the value-add stuff is different, where a
user's request requires lookups, possibly computations
and so forth. The problem here is that a naive
analysis of requirements (particulary the
transliteration error (Scroll down to "The
Customer Does Not Design Tables)
will tend to generate many cases of perceived need for
value-add where a simpler design can reduce these
cases to no value-add. But even when the database has
been simplified to pristine perfection, there are jobs
that require loops, multiple passes and so forth,
which must be made idempotent and robust, which
will always require some extra coding. But if you know
what you are doing, these always turn out to be the
ERP Allocation example given above: they are a lot more
about the data than the classes.

Another huge factor is where you come down on the
normalization debate, particularly on the inclusion of
derived values. If you keep derived values out of the database,
which is technically correct from a limited perspective,
then suddenly the value-add code is much more important
because without it your data is incomplete. If
you elect to put derived values into your database than
value-add code is only required when writing to the
database, so huge abstractions meant to handle any
read/write situation are unnecessary. (And of course,
it is extremely important to
Keep denormalized values correct
).

And the Rest of It

This essay hardly covers the entirety of
making code and data work together. You still have
to synchronize schema changes to code, and I still
think a data dictionary is the best D-R-Y way to
do that.

I hope this essay shows something of why many programmers
are so down on ORM, but much more importantly that there
are coherent philosophies out there that begin with a
different worldview and deliver what we were all doing
before ORM and what we will all still be doing after
ORM: delivering data back and forth between user and
database.

Saturday, December 11, 2010

The Cost of Round Trips To The Server

A database is not much without the applications
that connect to it, and one of the most important
factors that affects the application's performance
is how it retrieves data from queries. In this essay
we are going to see the effect of round trips
on application performance.

This blog has two tables of contents, the
Complete Table of Contents and the list
of
Database Skills.

Pulling 15,000 Rows

The test will pull 15,000 rows from a
table. We do it three different ways and see
which is faster and by how much.

Getting a Lot of Rows

The script below creates a table and puts 1 million
rows into it. We want far more rows in the table than
we will actually pull so that we can pull fresh rows
on every pass through the test. It is deliberately crafted to spread
out the adjacent values of the integer primary key.
This is because, inasmuch as can control what is
going on, we want
every single row to be on a different page, so that
in all tests the cost of retrieving the row is roughly
the same and we are measuring only the effect of our
retrieval methods.

The script can be run without modification in pgAdmin3,
and with slight mods on MS SQL Server.

create table test000 (
    intpk int primary key
   ,filler char(40)
)


--  BLOCK 1, first 5000 rows
--  pgAdmin3: run as pgScript
--  All others: modify as required  
--
declare @x,@y;
set @x = 1;
set @y = string(40,40,1);
while @x <= 5000 begin
    insert into test000 (intpk,filler)
    values ((@x-1)*200 +1,'@y');

    set @x = @x + 1;
end

-- BLOCK 2, put 5000 rows aside 
--
select  * into test000_temp from test000

-- BLOCK 3, Insert the 5000 rows 199 more
--          times to get 1million altogether
--  pgAdmin3: run as pgScript
--  All others: modify as required  
--  
declare @x;
set @x = 1;
while @x <= 199 begin
    insert into test000 (intpk,filler)
    select intpk+@x,filler from test000_temp;

    set @x = @x + 1;
end

Test 1: The Naive Code

The simplest code is a straight loop that
pulls 15,000 consecutive rows by sending
an explicit query for each one.

# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");

# Program 1, Individual explicit fetches
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 1, using $x1 to $x2";
$timeBegin = microtime(true);
while ($x1++ <= $x2) {
    $dbResult = pg_exec("select * from test000 where intpk=$x1");
    $row = pg_fetch_array($dbResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 1, elapsed time: ".$elapsed;
echo "\n";

Test 2: Prepared Statements

The next command asks the server to prepare a
statement, but it still makes 15,000 round trips,
executing the prepared statement with a new parameter
each time. The code looks like this:

# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");

# Program 2, Individual fetches with prepared statements
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 2, using $x1 to $x2";
$timeBegin = microtime(true);
$dbResult = pg_prepare("test000","select * from test000 where intpk=$1");
while ($x1++ <= $x2) {
    $pqResult = pg_execute("test000",array($x1));
    $row = pg_fetch_all($pqResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 2, elapsed time: ".$elapsed;
echo "\n";

Test 3: A single round trip

This time we issue a single command to retrieve
15,000 rows, then we pull them all down in one
shot.

# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");

# Program 3, One fetch, pull all rows
$timeBegin = microtime(true);
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 3, using $x1 to $x2";
$dbResult = pg_exec(
    "select * from test000 where intpk between $x1 and $x2"
);
$allRows = pg_fetch_all($dbResult);
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 3, elapsed time: ".$elapsed;
echo "\n";

Results

I ran this five times in a row, and this is what I got:

Naive 15,000	Prepared 15,000	One Round Trip
~1.800 seconds	~1.150 seconds	~0.045 seconds

Compared to the naive example, the set-oriented
fetch of al 15,000 rows in a single shot ran
40 times faster. This is what set-oriented
code does for an application.

While the prepared statement option ran faster than
the naive option, the
set oriented example still ran 25 times faster
than the repeated prepared statements.

I also re-arranged the order of the tests, and
the results were the same.

Does Server or Language Matter?

So this test was done using PHP against PostgreSQL,
will other servers and client languages get different
results? Given the same hardware, a different client
language or server is going to have a different spread
but the shape will be the same. Fetching all rows in
a single shot beats the living frack out of round trips
inside of loops in any client language against any
server.

Putting It Into Use

The most obvious conclusion is that any query
returning more than 1 row should return all rows
as a set. The advantage is so stark with large
row counts that it is worthwhile making this the
default for our applications, unless we can find
a very good reason not to. So what would the
objections be?

One objection might go something like, "Ken, I
see the numbers, but I know my app very well and
we never pull more than 10-20 rows in a pop. I
cannot imagine how it would matter at 10-20 rows,
and I do not want to recode." This makes sense
so I ran a few more
tests with 20 and 100 rows, and found that, on
my hardware, you need about 100 rows to see a
difference. At 20 rows all three are neck-in-neck
and at 100 the set is pulling 4 times faster than
the prepared statement and 6 times faster than the
naive statement. So the conclusion is not an
absolute after all, some judgment is in order.

Another thing to consider is how many simultaneous
reads and writes might be going on at any given
time. If your system is known to have
simultaneous transactions running regularly, then the
complete fetch may be a good idea even if you do some
tests for best-guess row count and the tests are inconclusive.
The reason is that the test is a single user case,
but multiple simultaneous users put a strain on
the database, even when they are not accessing the same
tables. In this case we want the application to
play the "good citizen" and get in and out as quickly
as possible to reduce strain on the server, which will
improve the performance of the entire application, not
just the portions optimized for complete fetches.

Another objection might be, "Well, my code needs to
pull from multiple tables, so I cannot really do this.
When we do -PROCESS-X- we go row by row and need to pull
from multiple tables for each row." In this case
you *definitely* need to go set oriented and pull all
associated quantities down in a query with a JOIN or two.
Consider this, if on your particular hardware the ratio
of naive row-by-row to single fetch is 10, and you must
pull from 2 other tables for each row, that means you are
really running 30 times slower (ratio is 10 x 3 reads)
than you could be.

A Final Note About PHP, Data Structures, and Frameworks

Back when dinosaurs ruled the Earth and there was
no internet (outside of Universities, etc),
the languages we used had specialized data structures
that were tuned to database use. Compared to those
older systems the newer languages born on the
internet are more or less starving for such a
data structure.

PHP gets by fairly well because its associative
array can be used as a passive (non object-oriented)
data structure that comes pretty close to what we had
before.

I bring this up because the choice of a language and
its support for a "fetch all" operation obviously
impacts how well the conclusions of this post can
be implemented. If your mapping tool has an iterator
that absolves you of all knowledge of what is going
on under the hood, it may be worthwhile to see if it
is doing a complete fetch or a row-by-row.

Thursday, December 9, 2010

Submit Analysis Request to the Database Programmer

I generally do not reveal too many details about systems I design for customers
or employers. This leaves me sometimes in a bind for example material. I either
have to simplify it beyond what I would like, or make something up that I have
not actually put into Production.

On top of that, one of the key themes of this blog is that table design
is a crucial skill, and if the examples I give do not match what you
are doing, they may be hard to make use of.

So I would like invite analysis requests. Go over to the
>Contact the Author page and drop me an email and tell me
about the system you are trying to design or optimize.

There are no rules on the type of system.

The most interesting mini-projects would be those where advice you
have been given elsewhere (or here for that matter) does not seem
to fit.

I will do my best to reply, even if I have to say no, so that
nobody is left wondering.

Remember this blog is one of those hobby/professional things,
good for all of us but nobody is getting paid, so if you are in
a terrible hurry this might not be the best thing.

Tuesday, December 7, 2010

PASS Summit Final Thoughts

I've got to figure out how to get to the PASS Summit every year! It was AWESOME!

It was great to meet more of the tweeps I follow and reconnect with those that I've met before.

It was my first time to Seattle also. The weather was better than expected. I had a great time walking around the city when time allowed. I also had the opportunity to try the local restaurants and visit the Space Needle. Everything was great. Especially the coffee at Top Pot Doughnuts.

Every time slot had several sessions I wanted to attend. You can't get to all the sessions you would like so buying the Conference DVDs is a must.

Participating in my local PASS user group, SQL Saturdays, and Twitter made being a FIRST TIMER at this year's PASS Summit a non-issue. I did not feel like a FIRST TIMER. I knew a lot of people because of participating locally. I just blended right in, had fun, and learned a ton.

Being a FIRST TIMER, I didn't realize I could download the slide decks before the DVDs arrive. A prompt from a co-worker made me aware of this benefit. I downloaded over a dozen decks so far and reviewed them. Great benefit. Still tons more to go through.

So, the most significant takeaways for me:

Meeting Ola Hallengren and Paul Randal. Talking to them about Ola's Maintenance Solution was invaluable.

The changes to the MCM programming have me thinking about making it a 2011 New Year's resolution. I'll have a post on that later.

The Separation of Duties Framework that Lara Rubbelke and ll Sung Lee debuted has me thinking how to use it back in the real world.

Suggestions for next year:

Any possibility of getting Youtube videos on the conference DVDs for the Top 10 highest rated sessions?

Put Twitter handles on the conference badges.

Select one of my speaker abstracts. ;-)

I've told my manager that we should be sending a representative every year. No excuses.

I had a great time at my First PASS Summit!

Friday, December 3, 2010

PASS Summit Day 3

The final day of the PASS Summit started off with a great keynote speech by Dr. David DeWitt Ph.D talking about SQL Query Optimization.

Dr. DeWitt is currently a technical fellow at Microsoft leading the Jim Gray Systems Lab. He said "Rocket Science is easier than query optimization." So, if you fail at query optimization, they send you to build rockets. This was one of those presentations that will need to be watched multiple times before all the non-Ph.Ds like me assimilate everything. He did a great job explaining an unbelievably complex topic in terms most people can understand. It'll just take me a few passes to totally understand. I hope.

DBA Mythbusters

A fun interactive session with Paul Randal from SQLSkills.com with Buck Woody playing a supporting role. Buck said the interaction with Paul was not rehearsed. Too bad video doesn't make it on the conference DVDs.

The PowerShell Cookbook for the DBA by Joe Webb

Joe did an excellent overview of how a DBA can use PowerShell to automate repetitive tasks. You can download the scripts from the session here.

Consolidating data collection with SQLDIAG and analysing it all with SQLNexus by Chris Bolton

I went to this session because I need to get more comfortable with these tools.

Chris did a very thorough walkthrough of how to use SQLDIAG and SQLNexus.

I need to read his book next.

Business Intelligence in SharePoint 2010 by Brian Knight

A very well-rehearsed, entertaining and educational session. The presenters were supposedly doing tequila shots the entire 90 minutes but they were still standing at the end of the presentation so I think something else was in the bottle.

My hope for this series of blog posts is to highlight just a few sessions that I found worthwhile. I think the PASS Summit conference DVDs are a compulsory purchase when attending. There are too many excellent sessions to attend in three days. I don't think you get the full value of the conference unless you buy the DVDs.

A Case When Table Design is Easy and Predictable

Good table design is a great foundation for a successful
application stack. Table design patterns basically resolve
into master tables and transaction tables. When we know
a thing or two about the master tables (or entities if you
prefer), we can infer a great deal about the transactions.

This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.

A Time Billing System

Imagine we have been asked to recode the company's
time-billing system. Because this is for the company
we work for, we have some inside knowledge about how
things work. We know that:

There are, of course, customers.
....and employees who record time
Each hour we record goes against a Work Order
There are different kinds of jobs, like
project management, programming, programming
management, and others.

Knowing only this, is it
possible to anticipate what the system will look like?
A safe answer is "no", on the claim that we will
undoubtedly learn more, but this safe answer happens
to be wrong. We can in fact anticipate
the overall shape of the system, and new information
will shift details, but it will not change the shape.

We can anticipate the nature of the transactions
if we determine the upper bound of complexity
and the combinatorial completeness of the
system.

The Upper Bound of Complexity

We can safely assume that the big number to get
right is going to be the billing rate. Our employer
assumes we will get everything else right, but the
billing rate is going to have them chewing their fingernails
until they know we understand it and have coded it
correctly.

The cool thing is that we already have enough information
to establish an upper bound on the complexity of
the system by looking at the master tables, where a master table
is generally one that lists details about real things
like people, places, things, or activities.
So far we know (or think we know) about three master tables:

Customers
Employees
Services

Now we define the upper bound of complexity as:

The upper bound of complexity
occurs when the billing rate is determined by all three
master entities.

In plain English, calculating a billing rate can be as
complicated as looking up a rate specific to a customer
for a service for an employee but cannot be more
complex than that because there are no other entities
with which to work.

Combinatorially Complete

We can also anticipate all possible calculations for
the billing rate by working through the complete set
of combinations of master entities. This would look
like the list below. Note that we are not trying to
figure out right now which of these is likely to occur,
we just want to get them listed out:

Each service has a default rate
Each customer has a negotiated rate
Each employee bills out at a default rate
The combination customer-service may have a rate
The combination customer-employee may have a rate
The combination customer-service-employee may have
a rate (this is the upper bound of complexity, all
three master entities determine the rate).

Unless we live in a super-simple world where only the first
item in the list is present, we will end up dealing with
several if not all of the combinations listed above.

Each of these combinations then becomes a table, and
we know the billing rate will be determined by a
resolution.

New Information

Now comes the big day and we interview with somebody
we'll call "The Explainer" who is going to officially
explain the billing system. Can he break what we
already know? No. At most he can:

Make us aware of new master entities, perhaps
there are "projects" and "contracts" that get their
own billing arrangements.
Dispel our notions about some of the combinations
by saying, "Oh we never give a customer a default
rate, the default rates come out of the services."

Going in Cold

What about the case where we know absolutely nothing
about an assignment when we go in to begin the interviews?
We can do a good job of thinking on our feet if we draw
"The Explainer" towards the master entities. As we gain
confidence that we know what the master entities are,
we can ask questions to probe Combinatorial Completeness
and the Upper Bound of Complexity.

One caveat: This method works for transactions between
master entities. When "The Explainer" starts describing
something that cannot be recognized as an interaction
between master entities, do not try to stuff the problem
into this box, it may not fit.

What About the Application?

At this point, we can also anticipate a lot of
what the application will look like. We will need
maintenance screens for all of the master entities,
and a really slick UI will allow for very easy editing
of those various cross-reference combination tables.
As long as that much is done, we are almost finished,
but not yet.

There will be some billing process that pulls
the time entries, finds the correct billing rate for
each one, and permanently records the invoices. If
we use a resolution this task is
child's play to code, debug, and maintain.

Then of course there is the presentation, the actual
bill. Depending on the company, these may be delivered
as hardcopy or in email. That will of course have to
be coded up.

Conclusion

There are two conclusions. First, as originally stated,
many transactions can be anticipated when you know what
the master entities are.

But secondly, and every bit as important, once the table
design is sound, the application pretty much writes itself.
On a personal note, this is probably why I do not find
application coding as exciting as I once did. Once I
realized that the real challenge and satisfaction was in
working out the tables, the coding of the app became a
bit of a drudge, it requires no judgment as far as
business rules are concerned.

Wednesday, December 1, 2010

PASS Summit Day 2

Day 2 started off with Andy Warren's recommendation of Top Pot Doughnuts.

I normally exercise in the morning so the walk down to Top Pot was a nice equivalent.

Top Pot sits right in front of pylon 43 of the Space Needle monorail on 5th Street downtown.

They have the mother of all glazed donuts and excellent coffee. Coffee so smooth I could drink it black. That's significant. I don't like black coffee.

Day 2 was to be BI day for me at the PASS Summit but it quickly turned into the search for Ola Hallengren.

I attended or wanted to attend the following sessions:

Automate SQL Server Administration with PowerShell by Allen White

I've read a bunch of Allen's blog posts so I wanted to see him in person. He did a broad overview of how to automate DBA tasks using PowerShell. If you're a DBA wanting to learn PowerShell, Allen should be on your list of people of blog posts to read and one of your Follows on Twitter. I did see all of this presentation and introduced myself to Allen afterwards.

50 Surprising Features of SQL Server Business Intelligence by Amir Netz, Donald Farmer

I wanted to see this session after hearing Amir Netz speaking during the morning keynote but I really wanted to meet Ola Hallengren at the Summit. So, I went searching for him. The power of Twitter helped me to get a description and picture of him.

Data Warehouse Design and Architecture Best Practices by Erik Veerman

I sat for the beginning of this presentation but then I heard Ola was in the Experts Area in the Expo hall so I bailed out. I did find Ola in the Experts Area and we spoke for over an hour about his Maintenance Solution script. We had a great conversation and I'm all the more convinced that we need to be using his script at my employer.

Business Intelligence End to End by Donald Farmer

This session was cancelled. I'm hoping it ends up on the DVDs somehow.

Another excellent day at the Summit. It was great to meet Ola Hallengren in person!

The Really Cool NTILE() Window Function

If you regularly code queries and have never been
introduced to the windowing functions, then
you are in for a treat. I've been meaning to write
about these for over a year, and now it's time to get
down to it.

Support in Major Servers

SQL Server calls these functions
>Ranking Functions.

PostgreSQL supports a wider range of functions
than MS SQL Server, having put them in at
8.4, and PostgreSQL and calls them
>Window Functions.

Oracle's support is broader (by a reading of the docs)
than SQL Server or PostgreSQL, and they call them
>Analytic Functions.

I try to stay away from MySQL, but I did a quick Google on
all three terms and came up with a few forum posts asking
when and if they will be supported.

The NTILE() Function

In this post we are going to look at NTILE, a cool function
that allows you to segment query results into groups and
put numbers onto them. The name is easy to remember because
it can create any -tile, a percentile, a decile, or anything
else. In short, an n-tile. But it is much easier to
understand with an example, so let's go right to it.

Finding percentiles

Consider a table of completed sales, perhaps on an eCommerce site.
The Sales Manager would like them divided up into quartiles,
four equally divided groups, and she wants the average and
maximum sale in each quartile. Let's say the company is not
exactly hopping, and there are only twelve sales, which is good
because we can list them all for the example. If we already
had the quartiles provided then the query would be easy, so if
we were lucky enough to be starting with this:


 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 DEALER   |  500.00 |   3
 RETAIL   |  738.00 |   3
 DEALER   | 1250.00 |   3
 RETAIL   | 2029.00 |   4
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

The query would be child's play if we already
had the quartile:


Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM ORDERS
 GROUP BY quartile
 ORDER BY quartile

The Problem is We Do Not Have Quartile

The problem of course is that we do not usually
have handy columns like QUARTILE provided, but
we can generate the QUARTILE column during the
query by using NTILE.


Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY quartile
 ORDER BY quartile

This query will give us what the Sales Manager wants.

Dissecting the Function and The OVER Clause

The NTILE() function takes a single argument, which tells
the server how many groups to divide the data into. If
there are not an exact number of rows in each group, the
server decides which groups will be missing one row. So
in an exact case all of your groups have the same count of
rows, but when it does not divide evenly, one or more of them
will be one row short.

If you pass 100 to NTILE(), you get a percentile. If you
pass 10, you get a decile, and so forth.

The magic is in the OVER() function. This supports two clauses,
and the example shows one, the ORDER BY. Quite simply, the
ORDER BY clause tells the server how to line up the rows when
adding the NTILE values. The clause is very flexible, and has
nothing to do with your query's overall ORDER BY clause.

The Second Clause: PARTITION

Now we will pretend the Sales Manager is not satisfied, and
wants separate numbers for the two Customer Types. We could
do this if the NTILE() function would create two sets
of quartiles, one for each Customer Type, like so:


 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 DEALER   |  500.00 |   3
 DEALER   | 1250.00 |   4
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 RETAIL   |  738.00 |   3
 RETAIL   | 2029.00 |   3
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

We can do this by using the PARTITION BY clause,
which tells the server to break the rows into
groups and apply the NTILE() numbering separately
within each group. The new query would be this:


Select custtype
     , quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (partition by custtype
                                 order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY custtype,quartile
 ORDER BY custtype,quartile

Bonus Points: The Median

Now once again the Sales Manager, who is never satisified,
comes down and says that the average is no good, she
needs the max and the median sale value within each quartile.
To keep it simple, she does not need this broken out
by customer type, it can be applied to the entire set.

This is a case where we can use NTILE() twice. The first
time we will break all sales up into four groups, to get
the quartiles, and then we will break up each quartile into
two groups to get the median. The code looks like this:


Select quartile
     , max(case when bitile=1 then amount else 0 end) as medAmount
     , max(amount) as maxAmount
  FROM (
        -- The second pass adds the
        -- 2-tile value we will use to find medians
        SELECT quartile
             , amount
             , ntile(2) over (partition by quartile
                                  order by amount) as bitile
          FROM (
                -- The subquery is necessary
                -- to process all rows and add the quartile column
                SELECT amount
                     , ntile(4) over (order by amount) as quartile
                  FROM ORDERS
               ) x1
       ) x2
 GROUP BY quartile
 ORDER BY quartile

The magic here is that we know we've divided the data
evenly into four sets, so the median will be the maximum
value half way through each set. In other words, it will be the
maximum value when the value of bitile=1 for each quartile.

One More Note About Oracle

Once you get down the basics of the OVER clause, Oracle
looks really good, because they support the clause over
the largest range of functions, at least going by the
respective doc pages for each platform.