Tuesday, November 30, 2010

Loops Without Cursors

Looping Without Cursors



Sometimes you need to process a table row-by-row,
and the established approach is to use cursors,
which are verbose, slow, and painful to code
and use.



The Cursor Example



Here is the basic minimum syntax required to
loop through a table and get something done.
The SQL flavor is MS SQL Server, but its not
much better in any other flavor.




-- I coded this off the top of my head, there
-- may be a minor syntax error or two


-- Most of this is pseudo-code, but take
-- note that it is ordered on column1

declare someCursorName cursor for
select column1, column2, column3
from anyTable
ORDER BY column1

-- Have to do this now
open someCursorName

-- Now you need to declare some variables
-- For the example I'm just making everything int

declare @column1 int
, @column2 int
, @column3 int

-- Gosh, we're actually about to start the loop! Finally!
fetch next from someCursorName into @column1,@column2,@column3
while @@fetch_status = 0 begin

-- If you still remember what you actually wanted
-- to do inside the loop, code it here:


-- Repeat this line from the top here again:
fetch next from someCursorName into @column1,@column2,@column3
end

-- Not done yet, these two lines are crucial
close someCursorName
deallocate someCursorName


Call me petty, but what I hate about that code is that I
have to refer to specific columns of interest 3 times (not
counting the declarations). You refer to them in the
cursor declaration and in the two FETCH commands. With
a little clever coding, we can vastly simplify this
and do it only once.



Using An Ordered Column



We can execute the same loop without the cursor if
one of the columns is ordered and unique. Let us say
that column1 is the primary key, and is an auto-incremented
integer. So it is ordered and unique. The code now
collapses down to:



-- I coded this off the top of my head, there
-- may be a minor syntax error or two


-- We can't get around declaring the vars, so do that
declare @column1 int
, @column2 int
, @column3 int

-- If you know a safe value for initialization, you
-- can use the code below. If this is not 100%
-- safe, you must query for the value or it must
-- be supplied from some other source

set @column1 = -1

-- BONUS POINTS: Can this become an infinite loop?
while 1 = 1 begin

-- Now we code the query and exit condition
select TOP 1
@column1 = column1
, @column2 = column2
, @column3 = column3
from anyTable
WHERE column1 > @column1 -- this is what advances the loop
ORDER BY column1

if @@rowcount = 0 begin
break
end

-- Put the actions here

end


Final Notes



The only requirement for this approach is
that you have a unique ordered column.
This usually means a unique key or primary
key. If "column1" is not unique, the loop
will skip all but the first value in each
group.



Also, it is very nice if you know a safe
value to use as an initializer. Without that,
you must query for the minimum value that matches
the condition and then decrement it by one.



Finally, can this loop become infinite? No.
Well, if, in the extremely unlikely situation
that rows are being added to the base table faster
than you are processing them, then yes, it could
go on for a very long time. But if that were
happening I'd say there was a separate problem to
look at.



It should probably go without saying, but if
the particular loop is going to happen very
often, the table should be indexed on your
unique ordered column. If it is a primary key
or you already have a unique constraint it is not
necessary to create an index explicitly because
there will be one as part of the key or constraint.

Saturday, November 27, 2010

Revisiting Normalization and Denormalization

In this blog I have done at many articles on Normalization
and Denormalization, but I have never put all of the arguments
together in one place, so that is what I would like to do today.



There are links to related essays on normalization and denormalization at the bottom of this post.



This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.



The What and Why of Normalization



Normalization is the process of designing tables so that each fact is
stored in exactly one place. A "fact" in this case is any detail that
we have to keep track of, such as a product's description, a product's
price, an employee's social security number, and so forth.



The process is all about figuring out what tables you need and what
columns each table will have. If we are talking about an employee's
social security number, then we can guess right from the start that
will have a table of EMPLOYEES, and that one of the columns will be
SSN. As we get more details, we add more tables and columns.



The advantage of normalization comes when your application writes
data to the database. In the simplest terms, when the application
needs to store some fact, it only has to go to one place to do it.
Writing this kind of code is very easy. Easy to write, easy to debug,
easy to maintain and improve.



When the database is not normalized, you end up spending more time
writing more complicated application code that is harder to debug.
The chances of bad data in your production database go way up.
When a shop first experiences bad data in production, it starts to
become tempting to "lock down" access to the database, either by
forcing updates to go through stored procedures or by trying to
enforce access to certain tables through certain codepaths. Both
of these strategies: stored procedures and code paths, are the
actually the same strategy implemented in different tiers, they
both try to prevent bugs by routing access through some bit of
code that "knows what to do." But if the database is normalized,
you do not need any magic code that "knows what to do."



So that, in brief, is what normalization is and why we do it.
Let's move on now to denormalization.



Denormalization is Harder to Talk About



Normalization is easy to explain because there is a clearly
stated end-goal: correct data. Moreover, there are well-defined
methods for reaching the goal, which we call the normal forms,
First Normal Form, Second Normal Form,
and higher forms. By contrast, denormalization is much harder
to talk about because there is no agreed-upon end goal. To make
matters worse, denormalization violates the original theory of
Relational Databases, so you still have plenty of people screaming
not to do it all, making things even more confusing. What we have
now in our industry is different
shops denormalizing in different ways for different reasons.



The arguments that I have heard in my career boil down to two
basic groups. The first set of arguments centers around
calculated or derived values, and the second set centers
around programmer convenience.



Arguments for Derived Values



My own experience comes down heavily in favor of denormalizing
by storing derived values directly into the tables, with the
extremely signficant caveat that you must have a way to ensure
that they are always correct. In this paradigm you maintain
strict normalization for facts supplied from the outside,
and then layer on additional facts that are calculated during
write operations and saved permanently.



Here is a very simple example.
A strictly normalized database happens to be missing data
that many programmers would automatically assume should be
stored. Believe it or not, a simple value in a shopping
cart like EXTENDED_PRICE is forbidden by 3rd normal form
because it is a non-key dependency, or, in plain
English, since it can be derived from other values (QTY * PRICE),
then it is redundant, and we no longer have each fact stored
in exactly one place. The value of EXTENDED_PRICE is only
correct if it always equals QTY * PRICE, and so there is now
a "fact" that is spread across three locations.
If you store EXTENDED_PRICE, but do not have a way to ensure
that it will always 100% of the time equal QTY * PRICE,
then you will get bad data.



So, given the risk of bad data, what is to be gained by
putting EXTENDED_PRICE into the cart? The answer is that
it adds value to the database and actually simplifies
application code. To see why, imagine a simple eCommerce
shopping cart that does not store any derived values.
Every single display of the cart to the user must go all
over the place to gather lots of details and recalculate
everything. This means re-calculating not just the
EXTENDED_PRICE, but adding in item level discounts, taking
account of possible tax exemptions for different items,
rolling
the totals to the cart, adding in tax, shipping, perhaps
a customer discount, a coupon, and who knows what else.
All of this just to display the cart, every time, no matter
what the purpose.



This situation leads to three problems. A pitifully slow
application (too many disk reads and lots of cycles calculating
the values), maddening bugs when an application update
has subtle changes to the calculations so the customer's
order no longer displays the same numbers as it did yesterday,
and the frustrating requirement that the simplest of reports
must route through application code to calculate these values
instead of simply reading them off the disk, which leads to
reporting systems that are orders of magnitude slower than they
could be and horribly more complicated than they need to be
because they can't just read straight from the tables.



Now let's look at how that same shopping cart would be used
if all of those calculated values were generated and saved
when the order is written. Building on your foundation of
normalized values (price, qty), you need only one body of code
that has to perform calculations. This magic body of code
takes the user-supplied values, adds in the calculations,
and commits the changes. All other subsequent operations
need only to read and display the data, making them faster,
simpler, and more robust.



So the obvious question is how to make sure the derived
values are correct. If they are correct, we gain the
benefits with no down side. If there is the smallest chance
of bad data, we will quickly pay back any benefit we gained
by chasing down the mistakes.



From a technical standpoint, what we really need is some
technology that will make sure the calculations cannot
be subverted, it cannot be possible for a stray
bit of program code or SQL Statement to
put the wrong value in for EXTENDED_PRICE. There are a
few generally accepted ways to do this:



  • Require all writes to go through a certain codepath.
    The only PRO here is that you keep the logic in the
    application code, and since most shops have more programmers
    than database people, this makes sense. The only CON is that
    it never works. One programmer working alone can maintain
    discipline, but a team cannot. All it takes is one programmer
    who did not know about the required codepath to screw it all
    up. Also, it makes your system inflexible, as it is no longer
    safe to write to the database except through a single application.
  • Require all writes to go through stored procedures.
    This is nominally better than the codepath solution because it
    is not subvertible, and you can allow different side apps and
    utilities to safely write to the database. But it makes a lot
    of work and tends to be very inflexible.
  • Putting triggers onto tables that perform the calculations
    and throw errors if a SQL statement attempts to explicitly
    write to a derived column. This makes the values completely
    non-subvertible, ensures they will always be correct, and allows
    access from any application or utility. The downside is that
    the triggers cannot be coded by hand except at extreme cost, and
    so must be generated from a data dictionary, which is fairly easy
    to do but tends to involve extreme psychological barriers. In
    these days of ORM many programmers mistakenly believe their
    class files define reality, but this is not true. Reality is
    defined by the users who one way or another create the paychecks, and
    by the database, which is the permanent record of facts. But
    a programmer who thinks his classes define reality simply cannot
    see this and will reject the trigger solution for any number of
    invalid reasons.


So denormalizing by putting in derived values can make a database
much more valuable, but it does require a clear systematic
approach to generating the derived values. There is no technical
problem associated with ensuring the values are correct because
of course the application has to do that somehow somewhere anyway,
the real barriers tend to be the psychological and political.



Arguments For Programmer Convenience



The second set of arguments for denormalization tend to be
rather weak, and come down to something like this (you have to
picture the programmer whining like a child when he
says this), "I don't like
my data scattered around so many tables, can't we play some
other game instead?"



Many programmers, when they first learn about normalization
and build a normalized database,
discover that the data they need to build a screen is "scattered"
about in many tables, and that it is tedious and troublesome to
get it all together for presentation to the user. A simple
example might be a contacts list. The main table is CONTACTS,
and it contains not much more than first and last name. A second
table is a list of PHONES for each contact, and a third
table is a list of various mailing addresses. A fourth table
of EMAILS stores their email addresses. This makes four tables
just to store a simple contact! We programmers look at this and
something inside of us says, "That's just way too complicated,
can't I do something else instead?"



This is a case of programmer convenience clashing with correctness
of data. Nobody argues (at least not that I've heard) that they
do not want the data to be correct, they just wonder if it is possible
to simplify the tables so that they do not have to go out to so
many places to get what they need.



In this case, programmers argue that denormalization will make
for simpler code if they deliberately skip one or more steps
in the normalizing process.
(Technically I like to call the
result a "non-normalized" database instead of denormalized, but
most people call it denormalized, so we will go with that.)



The argument goes something like this: I know for a fact that
nobody in the contacts list will have more than 3 emails, so
I'm going to skip the EMAILS table and just put columns EMAIL1,
EMAIL2, and EMAIL3 into the main CONTACTS table. In this case,
the programmer has decided to skip 1st Normal Form and put a
repeating group into the CONTACTS table. This he argues
makes for simpler database retrieval and easier coding.



The result is painfully predictable. The simplification the
programmer sought at one stage becomes a raft of complications
later on. Here is an example that will appear trivial but really
gets to the heart of the matter. How do you count how many
emails a user has? A simple SELECT COUNT(*)...GROUP BY CONTACT
that would have worked before now
requires more complicated SQL. But isn't this trivial? Is it
really that bad? Well, if all you are coding is a CONTACTS
list probably not, but if you are doing a real application with
hundreds of tables and this "convenience" has been put out there
in dozens of cases,
than it becomes a detail that programmers need to know on a
table-by-table basis, it is an exception to how things ought
to be that has to be accounted for by anybody who touches the
table. In any shop with more than 5 programmers, whatever
convenience the original programmer gained is lost quickly
in the need to document and communicate these exceptions.
And this is only a single trivial example.



Other examples come when it turns out you need more than
three slots for phone. In the normalized case this never comes
up. Any user can have any number of phones, and the code to
display the phones is running through a loop, so it does not
need to be modified for the case of 1 phone, 2 phones, etc.
But in the "convenient" denormalized case you now must
modify the table structure and the code that displays the contacts,
making it quite inconvenient.



Then you have the case of how to define unused slots. If the
user has only one email, do we make EMAIL2 and EMAIL3 empty
or NULL? This may also seem like a silly point until you've sat
through a flamewar at the whiteboard and discovered just how
passionate some people are about NULL values. Avoiding that argument
can save your shop a lot of wasted time.



In short, programmer convenience should never lead to a shortcut
in skipping normalization steps because it introduces far
more complications than it can ever pay for.




Related Essays




This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.



The normalization essays on this blog are:


Saturday, November 20, 2010

Prepare Now For Possible Future Head Transplant

This is the Database Programmer blog, for anybody who wants
practical advice on database use.



There are links to other essays at the bottom of this post.



This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.



Planning For The Unlikely



We programmers love to plan for things that will hardly
ever happen, like coding the system's upgrade engine to
handle spontaneous human combustion, making sure the
SQL scrubbing layer can also launch a rocket into
space, and, well, trying to work out ahead of time
what to do if
we ever need a head transplant.



The boss comes over and says, "can you toss a simple
plot onto the sales' staff home page that shows sales
by day? Use that 'jquicky' or whatever you call it.
Should take a couple of hours, right?" And three days
later we're working on the world's greatest plotting
system that can report everything except what the
boss actually asked for because we haven't gotten
around to that part of it yet.
(Really, can he expect
me to just bang this out without the required
Seven Holy Layers of Abstraction and Five Ritual
Forms of Parameterization, and the Just and Wholesome
Mobile Support, or the features Not Yet Required
but visible to the Far Seeing Wise and Learned Men?)



Abstraction Contraptions



So what I am getting at is that programmers of all
stripes are addicted to abstraction, it gives us
goosebumps and makes us feel warm and tingly, and
so we do it even when we do not need to. We
build abstraction contraptions.



When it comes to designing a database, this
unhealthy proclivity can seriously slow you
down, because of what I call:



Ken's Law



Everybody wants to be remembered for something. If
I could write my own epitaph, it might be:




Table-based datastores are optimally abstract



This law is not about database access, it is about
database design. It can be expressed informally as:




People Understand Tables Just Fine



Or more rigorously as:




Table-based datastores are optimally
abstract; they require no additional abstraction
when requirements are converted to desgin; they
cannot be reduced to a less abstract form.



Structured Atomic Values



I should point out that this essay deals with
structured atomic values, who live in the
Kingdom of The Relational Database. The
concepts discussed here do not apply to
free-text documents or images, or sound
files, or any other media.



No Additional Abstraction Required



My basic claim here is that you cannot create
an abstraction of data schemas that will pay
for itself. At best you will create a
description of a database where everything
has been given a different name, where tables have been
designated
'jingdabs' and columns have been designated
'floopies' and in the end all of your jingdab
floopies will become columns in tables. Oh,
and I suppose I should mention the Kwengars will
be foreign keys and the Slerzies will be primary
keys.



After that it goes downhill, because if we
generate an abstraction that is not a simple
one-to-one mapping, we actually obscure the
end goal. Consider an example so simple as to
border on the trivial or absured.
Why would we ever use the terms
"One-to-Many" or "Many-to-Many" when the more
precise terms "child table" and "cross-reference
table" convey the same idea without the noise?
I said above that this would sound
trivial, and you can accuse me of nit-picking,
but this is one of those camel's nose things,
or perhaps a slippery slope. When technical folk
get together to design a system, we should call
things what they are, and not make up new words
to make ourselves sound smarter.




No de-Abstraction is Possible



The second half of Ken's law says that you
cannot de-Abstract a table schema into some
simpler form. This should be very easy to
understand, because relational databases
deal with atomic values, that is, values
which cannot themselves be decomposed. If
you cannot decompose something, then it cannot
be an abstraction of something more specific.



Going further, if the schema has been
normalized, then every fact is stored in
exactly one place, so no further simplification
is possible. If you cannot simplify it or
resolve it into something more specific, then
it is not an abstraction of something else.




But Does it Work?



I originally began to suspect the existence
of what I call in all humility "Ken's Law"
when I was sitting at a large conference table
with my boss, her boss, a couple of peers, and
3 or 4 reps from a Fortune 5 company. My job
was basically to be C3PO, human-cyborg relations.
Some people at the table protested loudly to
being non-technical, while others had technical
roles. But everybody at the table spent all
day discussing table design.



Later on, when at a different position, the
programmers received their instructions from
Project Managers. The best Project Managers worked
with their customers to figure out what they were
trying to keep track of, and handed us specs that
were basically table layouts. The customers loved
these guys because they felt they could "understand
what the project manager was talking about", and
the project managers, who swore they were not technical,
were respected because they handed us requirements
we could actually understand and implement.



Since that time I have further learned that it
is very easy for anybody who deals with non-technical
people to bring them directly to table design without
telling them you are doing it. All you have to do
is listen to what words they use and adopt your
conversation accordingly. If they say things like
"I need a screen that shows me orders by customer
types" they have told you there will be a table of
customer types. Talk to them in terms of screens.
If they say, "Our catalog has 3 different
price list and four discount schemes" then you know that
there will be a PRICELIST table, a DISCOUNTS table, and
likely some cross-references and parent-child relationships
going on here.



So How Does ORM Come Into This?



One of the greatest abstraction contraptions of
this century (so far), is ORM, or Object-Relational
Mapping, which I do not use
precisely because it is an abstraction contraption.



To be clear, the mistake that ORM makes is not
at the design phase, but at the access phase.
The ORM meme complex instructs its victims
that it is ok to put
structured atomic values into a Relational
Database, but when it comes time to access and
use that data we will pretend we did not put it
into tables and we will pretend that the data is in
objects.
In this sense the term Object-
Relational Mapping is a complete misnomer, because
the point is not to map data to objects but to
create the illusion that the tables do not even
exist.



Perhaps ORM should stand for Obscuring Reality Machine.



Getting Back to That Head Transplant



So what about that weird title involving head
transplants? Obviously a head transplant is
impossible, making it also very unlikely, besides
being silly and non-sensical. It came to mind
as a kind of aggregrate of all of the bizarre
and unrealistic ideas about abstracting data
designs that I have heard over the years.



One of these ideas is that it is possible and
beneficial to create a design that is abstract
so that it can be implemented in any model:
relational, hierarchical, or network. I'm not
saying such a thing is impossible, it is likely
just a small matter of programming,
but for heaven's sake, what's the point?



Conclusion



So don't waste time creating abstractions that
add steps, possibly obscure the goal, and
add no value. Don't plan for things that are
not likely to happen, and avoid abstraction
contraptions.




Related Essays




This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.



Other philosophy essays are:



Saturday, November 13, 2010

Database Skills

It seems strange to me that I've been working on this blog
for 3 years or so (with one very long break) and somehow never
got around to writing a simple list of skills that all
database experts need. So here it is!



Various Job Tiles for Database People



There are three common job titles in the database area,
which are Database Administrator (DBA), Database Programmer,
and Database Architect. These titles tend to be somewhat
variable from shop-to-shop, but generally the "Architect"
term indicates the highest level of skill combined with
considerable management responsibilities. The "Programmer"
term is somewhere below that, but the "DBA" is extremely
variable. I have seen shops where a person was called a
DBA and filled a relatively constrained role closer to
IT or operations (routine tasks, no real programming) and
other shops where a person with the DBA title was basically
the Architect.



Because of this high variability in what titles mean, I am not
going to waste time categorizing skills as belonging to one
job title or another, I am simply going to list them all out.



The various levels of skills are these:



  • Before Hello World!: The basics of tables, columns, rows
  • The Hello World! Level: SQL Select
  • Just after Hello World!: Writing Data
  • Commands to create, modify and drop tables, or Data
    Definition Language (DDL)
  • Knowing how to use a Query Analyzer or optimization tool
  • Understanding Normalization
  • Understanding Denormalization
  • Understanding Primary Keys, Foreign Keys and Constraints
  • Understanding Transactions
  • Understanding ACID
  • Understanding Indexes as optimization tool
  • Views
  • Database Security
  • Upgrades and Installs
  • Efficient access of database from application
  • Bulk operations: loading or exporting large amounts
    of data
  • Understanding of Contexts and how they lead to
    different sets of Best Practices
  • Preventing performance degradation through
    various maintenance tasks
  • Deployment strategies: partitioning, tablespaces
  • Deployment strategies, failure protection, from
    simple backup to hot standbys
  • Server side coding: stored procedures and functions
  • Server side coding: triggers
  • Temporary tables


As long as that list is, it only covers those of us who
use database systems. There is an entire set of
skills for those who actually create and maintain these
systems, but that is not something that will be treated
in this blog.



Before Hello World!: Tables and Columns



If you have never so much as typed a single SQL command,
or seen a table diagram, or anything like that, then it is
worth a few minutes to go through the basics of what
a database does, which is to organize atomic values into
tables.



I am going to write an essay on this soon, even though it
may seem so obvious as to be completely unnecessary. But I
will do it because the most popular essay on this
blog is about using GROUP BY, which tells me newer programmers
are starving for useful tutorials at the beginner level.
So it seems to me, why not put something out there at the
very beginning of the beginning?




The Hello World! Level: SQL Select



If you are starting completely from scratch and want to know
about database programming, you want to start with the SQL
SELECT command. This is the (almost) only command used to
extract data from a database, and all of the possible ways to
combine, filter and extract data are expressed in the many
clauses of this command.






Just after Hello World!: Writing Data



When it comes time to change the data in a database
there are three commands, listed below. These commands
are based on the tables-and-rows nature of databases,
and allow to add a row (or rows), change a row (or rows)
and delete a row (or rows).

  • The INSERT command
  • The UPDATE command
  • The DELETE command


Commands to create, modify and drop tables, or Data
Definition Language (DDL)



The term "DDL" stands for "Data Definition Language"
and includes all of the commands use to build the
tables that will hold the data for the INSERT, UPDATE,
DELETE and SELECT statements. The basic list of
commands to be familiar with is:



  • Understanding Data Types (databases are strongly typed)
  • CREATE TABLE and ALTER TABLE
  • Commands to add and drop primary keys
  • Commands to add and drop foreign keys
  • Commands to add and drop constraints
  • Commands to add and drop indexes


There are also numerous commands that are specific
to different products. Those will not be listed here
today, but who knows what the future may bring.




Knowing how to use a Query Analyzer or optimization tool



Database programmers, once they get started with the skills
listed above, tend to become more and more obsessed with
performance. Every major database has some type of tool
that lets you examine how the server is going to process
a SQL SELECT, and database programmers depend on these tools
to discover where they might alter tables or indexes or the
SELECT itself to make the queries go faster.




Understanding Normalization



The term "normalization" refers to the process of analyzing
the data that your system is required to store, and organizing
it so that every fact is stored in exactly one place. Understanding
how to normalize data is an absolute requirement for the
database programmer who wants to design databases.



We speak of normalization in "forms" as in "first normal form",
"second normal form", and so on. It is a good idea to understand
The argument for normalization and then to pursue
at very least:





Normalization is a a fascinating topic to study, and it
extends all they way up to "Domain-key Normal Form" which is
considered the most complete normalization for a database.




Understanding Denormalization



Every database programmer, after fully understanding
normalization, realizes that there are severe practical
problems with a fully normalized database, such a database
solves many problems but generates problems of its own.
This has led programmer after programmer down the path
of denormalization, the deliberate re-intoduction
of redundant values to improve the usability of the
database.



There is a surprising lack of material available on the
web regarding denormalization strategies. Most of what
you find is arguments and flame wars about whether or not
to do it, with little to nothing on how to actually do it.
For this reason, I provide my own essays on this blog on
the strategies and methods I have worked out over the years:



After reviewing The Argument For Denormalization
it is worthwhile to follow up with:





The arguments for and against denormalization are heavily
affected by the Pay me now or pay me later
design tradeoff.




Understanding Primary Keys, Foreign Keys and Constraints



One might argue that this list of skills belongs much higher
up the list, up there with the CREATE TABLE command. However,
I have it useful to distinguish between simply knowing the
commands
to make a primary key and actually understanding
the tremendous power of keys.



In this author's opinion it is not truly possible to understand
how powerful and beneficial Primary keys and Foreign Keys are
for an entire application stack until you have learned the commands,
built some databases, and worked through the concepts of normalization
and denormalization. Only then can you revisit these humble
tools and realize how powerful they are.





Understanding Transactions



The word "transaction" has two meanings in common day-to-day
database talk. One meaning is very loose and refers to some
individual command or set of commands. You might hear somebody
using the term loosely when they say, "We're seeing about
10 transactions per second this week."



The more rigorous use of the term refers to a statement or
set of statements that must be guaranteed to either
complete in their entirety or fail in their entirety.

This is a profoundly important concept once you get beyond
simply making tables with keys and get into real-world
heavy multi-user activity. And this leads us to the
next topic...



Understanding ACID



Modern relational databases expect multiple simultaneous
users to be writing and reading data all of the time.
The term "ACID Compliance" refers to both the philosophy
of how to handle this and the actual methods that
implement that philosophy. The term ACID refers to:



  • The Atomic nature of each transaction
  • The Consistentcy of the database during and
    after simultaneous overlapping transactions
  • The Isolation of each transaction
  • The Durability of the results


Understanding Indexes as optimization tool



An index is a special tool databases use to provide very
rapid access to large amounts of data. Just like keys, it
is not enough to know the commands, it is necessary to
understand the subtle power of indexes when used with some
craftsmanship. The basic uses of indexes are:



  • A simple index on a column to provide rapid search
    on that column
  • A "covering index" that includes extra columns that
    can further speed up certain common access patterns
  • Clustered indexes (MS SQL Server) and what they give
    and what they take away
  • The cost of indexes on write operations



Views



A view looks like a table to the SQL SELECT command. The view
itself is a stored SQL SELECT command that encodes some
query that is either used very often or is very compex. In
all cases, views are used to present the database data to
the application in some simplified convenient or secure
form. The two major uses of views are:

  • To simplify the application programmer's job
  • To provide a read-only interface for
    some applications


Upgrades and Installs



If you are a single programmer or hobbyist working with a
database, it is all well and good to just add and drop tables
as you wish. But as soon as you get into development
with quality control stages and multiple programmers, it becomes
evident that you need a strategy for handling the
schema changes that come with with new versions of
the system. There are multiple essays available on
this blog, covering:







Database Security



Databases provide incredible security provisions that are
just about completely ignored by modern web developers.
Sometimes there is good reason for this, but overall anybody
who wants to become a truly accomplished Database Programmer
or Database Architect must have a thorough understanding
of database security and how it can simplify the entire
system stack.



Database security comes down to specifying who is allowed
to perform the 4 basic operations of INSERT, UPDATE,
DELETE and SELECT against which tables:



My basic introduction to security is here.

  • Understanding roles (we used to say users and groups)
  • Simple table-level security
  • Column-level security (not widely supported)
  • Row-level security (not widely supported)



Efficient access of database from application



Imagine you have the perfectly designed database, with
every nuance and subtlety excellently crafted in the
ares of keys, indexes, normalization, denormalization
and security. At this point your job branches out into
several new areas, but one of the most important is
knowing how to write application code that efficiently
accesses the database.



Bulk operations: loading or exporting large amounts of data



Some database applications involve a large number of small
transactions, where each trip to the database writes only a
single row or reads only a dozen or so rows.



But in many cases you need to bulk load large amounts of
data in one shot, thousands or even millions of rows. In
these cases the techniques that govern small transactions
are useless and counter-productive, and you need to learn
some new commands and strategies to handle the bulk loads.



Understanding Contexts and how they lead to different sets of Best Practices



Not all databases are created for the same purpose. If you
have a very large operations then it will likely have multiple
independent databases that fill the classical roles, while in
a smaller shop the roles may be combined in one database. I
like to refer to these roles as "contexts" because they determine
how the tables will be designed and how acess to the tables
will be governed. The major contexts are:



  • OLTP or Online Transaction Processing, characterized
    by simultaneous reads and writes, generally assumes
    little or no periods of inactivity, and generally
    assumes that the individual transactions are very
    small. The apps we were all writing in the 80s and
    90s to do accounting, ERP, MRP, Job control, payroll,
    airline reservations and many others fall into this
    context.
  • Data Warehouse context, characterized by periodic
    bulk loads of new information with most activity
    being reads. The Data Warehouse context is largely
    associated with the "Star Schema" table design.
    Data in a Warehouse is historical, it never changes
    after it is loaded.
  • CMS or Content Management System, also characterized
    by very few writes compared to reads, but more likely
    to have a normalized structure. Unlike a Data
    Warehouse, the data is subject to change, just not that
    often.
  • Any other Read Only Context. I include this category
    because I spent some time working on Direct Marketing
    databases, which are like a Data Warehouse in that they
    are updated periodically and the data does not change,
    but the Star Schema is completely inappropriate for them.


If you consider a huge online shopping system, you can see that
within that application there are at least two contexts. The
product catalog is likely to see vastly fewer writes than
reads, but the shopping cart tables will be in a constant state
of reads and writes.





Preventing performance degradation through
various maintenance tasks



Once the database and its application stack is up and running,
and the reads and writes and coming through, the laws of
thermodynamics come into play and system performance can
begin to degrade even if the database stays the same size
and the load on the system is steady.



Different vendors have different tools for combatting this,
but they tend to come down to reclaiming temporary space and
occassionally rebuilding indexes. There are also log files
that have to be purged, regular backups to be made, and other
operations along those lines.



Deployment strategies: partitioning, tablespaces



When systems become sufficiently large, it is no longer
possible to just attach some disks to a box and run
a database server. The Database Architect must consider
breaking different tables out onto different sets of
spindles, which is usually done with "tablespaces", and
moving older data onto slower cheaper spindles, which is
often done with Partitioning.



Deployment strategies, failure protection, from
simple backup to hot standbys



Because a database typically experiences simultaneous
reads and writes from multiple sources, and may be expected
to be up and running 24/7 indefinitely, the concept
of making a backup and recovering from a failure becomes
more complicated than simply copying a few files to a
safe location.



In the most demanding case, you will need to provide a
second complete box that can become fully live within
seconds of a disastrous failing of the main box. This is
called various things, but Postgres calls it a "hot standby"
in version 9 and some MS SQL Server shops call it a
"failover cluster."



The ability to come up live on a second box when the first
one fails is made possible by the way databases handle
ACID compliance, and the fact that they produce something
called a Write-Ahead-Log (WAL) that can be fed into a
second box that "replays" the log so that its copy of the
database is getting the same changes as the master copy.



Server side coding: stored procedures and functions



I really could not figure out where to put this entry
in the list, so I just punted and put it near the end.
It could really go anywhere.



Stored procedures or functions are procedural routines
(not object oriented) that are on the database server and
can be invoked directly from an application or embedded
inside of SQL commands. Generally speaking they provide
various flow-control statements and rudimentary variable
support so that you can code multi-step processes on the
server itself instead of putting them in application code.



Server side coding: Triggers



Triggers are quite possibly the most elegant and beautiful
technology that servers support, absolutely the least
understood, and definitely the most reviled by the
ignorant. You will find virtually no web content today
that will explain why and how to use triggers and
what they are good for.



Except of course for my own essay on
triggers
that discusses them in terms of
encapsulation.

Temporary tables



Temporary tables are like Stored Procedures inasmuch as
I had no idea where to put them in the list, so they
just ended up at the end.



As the name implies, a temporary table is a table that
you can create on-the-fly, and which usually disappears
when your transaction is complete. They are most often
found in Stored Procedures. They can impact performance
for the worst in many ways, but can be extremely
useful when you are doing multi-staged analsysis of
data in a Data Warehouse (that's where I use them the most).

Saturday, November 6, 2010

Recursive Queries with Common Table Expressions


This week The Database Programmer returns after almost 18 months
with an entry on using Common Table Expressions (CTEs) to do
recursive queries. Relational databases were plagued from their
inception with a lack of meaningful treatment for recursive
operations, and CTEs have finally plugged that hole.




Common Table Expressions appeared in SQL Server 2005, and in
PostgreSQL 8.4, and are also available in Oracle. As for
mySQL, since I don't use it, I did a quick Google search and
looked at the Docs for 5.5, and couldn't really find
anything. I generally tend to assume mySQL cannot do the
tough stuff.



But First, A Rant




There have always been plenty of people who claimed SQL was
a bloated and clumsy language to work with. Most of the time
I tend to agree, but I find the advantages of relational/SQL
system to be so large that I'm willing to pay that price.




But with Commom Table Expressions (CTEs) I just can't help
drifting into conspiracy theories involving the enemies of
SQL infiltrating the committees and deliberately suggesting
the most twisted, bloated, complicated way they could think
of to do what is really a very basic operation. In other
words, I am profoundly unimpressed with the syntax of CTEs,
but as long as they are here and they work, we'll go along.



The Basic Example




Your basic recursive table contains a foreign key to itself,
so that some rows in the table are children of some other
row in the table. This recursion can nest to any depth,
and the chart below shows a very simple example:




Primary_key | Parent_Key | Notes
--------------+---------------+---------------------------
A | null | top level row, no parent
B | A | first level child of A
C | B | child of B, grandchild
| | of A
D | C | child of C, grandchild
| | of B, great-grandchild
| | of A
X | null | top level row, no parent
Y | X | child of X
Z | Y | child of Y, grandchild
| | of X


What we want is a query that can return a given row
and all of its children out to any level, including
helpful information about the structure of the recursion,
something like this:




Primary_key | Level | Top_Key | Immediate_Parent_key
------------+-------+---------+-----------------------
A | 1 | A | null
B | 2 | A | A
C | 3 | A | B
D | 4 | A | C
X | 1 | X | null
Y | 2 | X | X
Z | 3 | X | Y



And Another Rant




At this point the mind boggles at how long this blog entry
needs to be to explain this simple operation. But lets
get going anyway.



The First Step and Last Step



A Common Table Expression begins with the "WITH" clause
and ends with a standard SQL Select:




;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
....we'll get to this below....
)
select * from myCTEName


The basic idea is that we are going to define a CTE with
a name and a list of columns, and then SELECT out of it.
Without that final SELECT statement the CTE does not actually
do anything. The SELECT can also be arbitrarily complex, doing
aggregrations, WHERE clauses and anything else you might need.



The first thing to notice is the leading semi-colon. This is
a trick adopted by MS SQL Server users. SQL Server does not
require statements to be terminated with a semi-colon, but a
SQL Server CTE requires the previous statement to have
been terminated with a semi-colon (nice huh?). So SQL Server
programmers adopted the strategy of starting the CTE with
a semi-colon, which keeps the syntactical requirement with
the CTE, where it belongs.



A given CTE sort of has a name. That is, you have to name
it something, but think of it as a table alias in a SQL SELECT,
such as "Select * from myTable a JOIN otherTable b...", it
exists only during the execution of the statement.



The columns listed in the parantheses can have any names
(at least in SQL Server). But these column names are what
you will refer to in the final SQL SELECT statement.



Coding The Inside of the CTE, Step 1



Now we code the inside of the CTE in two steps.
The first step is called the "anchor", and it is a
straightforward query to find the top-level rows:




;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
select primary_key as primary_key
, 1 as level
, primary_key as top_key
, null as immediate_parent_key
from myRecursiveTable
where Parent_key is null
)
select * from myCTEName


This should be self-explanatory, we are querying only for
rows that have no parent (WHERE Parent_key is null) and we
are hardcoding the "level" column to 1, and we are also
hardcoding the "immediate_parent_key" column to null.



This query alone would return two of the rows from
our desired output:




Primary_key | Level | Top_Key | Immediate_Parent_key
------------+-------+---------+-----------------------
A | 1 | A | null
X | 1 | X | null


Coding The Inside of the CTE, Step 2



Now we are going to add the actual recursion. When I first
learned CTEs this was the hardest part to figure out, because it
turned out my hard-won set-oriented thinking was actually slowing me
down, I had to think like a procedural programmer when defining
the second half of the query.




;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
select primary_key,1,primary_key,null
from myRecursiveTable
where Parent_key is null
UNION ALL
select chd.primary_key,par.level+1,par.top_key,chd.parent_key
FROM myCTEName par
JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select * from myCTEName


Thinking step-wise, here is what is going on under the hood:



  1. The server executes the "anchor" query, generating a
    result set called "myCTEName" containing just the
    top level rows.
  2. The server then executes the second query. At this
    point the result set "myCTEName" exists and can be
    referenced, so that you can link children to their
    parents. (That's why you see the JOIN)
  3. Step 2 is repeated recursively, adding grand-children,
    great-grand-children, and so on, until no more rows
    are being added, at which point it stops, and...
  4. The final result set is passed to the trailing
    SELECT, which pulls results out of "myCTEName"
    as if it were a table or view.


So when we code the 2nd part of the inside of the CTE, the
part after the UNION ALL, act as if the first query has
already run and produced a table/view called "myCTEName"
that can be referenced. Once you understand that, the
query is pretty easy to understand:



  • The "From myCTEName par" clause tells us we are pulling
    from the previously generated set. I like to use the alias
    "par" for "parent" to remind myself that the prior result is
    the parent row.
  • We then join to the original source table and use the
    alias "chd" to remind ourselves we are pulling child rows
    from there. The "ON chd.parent_key = par.primary_key"
    defines how children are joined to parents.
  • Our first column, "chd.primary_key", is the unique
    key for the results.
  • Our second column, "par.level+1" gives us a nifty
    automatically incremented "level" column.
  • Our third column, "par.top_key" ensures that all rows
    contain a reference to their top-most parent.
  • Our final column, "chd.parent_key", makes sure each
    row contains a reference to its immediate parent.


Finding Various Statistics



Once you have the inside of the CTE coded, the fun part
moves to the final SELECT, which is operating on the
complete set of results. You do not necessarily have to pull
the complete list. For instance, you may want to find out
the maximum nesting level for each parent, or the count
of children for each parent:




;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
select primary_key,1,primary_key,null
from myRecursiveTable
where Parent_key is null
UNION ALL
select chd.primary_key,par.level+1,par.top_key,chd.parent_key
FROM myCTEName par
JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select top_key
, max(level) as maxNestingLevel
, count(*) as countRows
, count(*)-1 as countChildren
from myCTEName


Conclusion



Common Table Expressions give SQL-based databases the (very)
long-needed ability to execute recursive queries, albeit with
a rather complex syntax. Once you grasp the basics of how
to code them, there are many possible uses that go far beyond
the simple example shown here.