Reader Dean Thrasher
of Infovark has submitted
a schema for review and analysis as part of my
>User-Submitted Analysis Request series.
Today we are going to take a first look
at what he has. Mr. Thrasher and I both hope that any and all readers
will benefit from the exercise of publicly reviewing the schema.
This particular analysis request is a great start to the series,
because it has to do with email. Everybody uses email so we all
understand at a very basic level what data will be handled.
Brief Introduction to User-Submitted Schemas
Mr. Thrasher and I have exchanged a couple of emails, but we have
avoided any in-depth discussion. Instead, we want to carry out the
conversation on the public blog. So I am not aiming to provide
any "from on high" perfect analysis, instead this essay will contain
a lot of questions and suggestions, and we will then move into the
comments to go forward.
Disclosure: None. We are not paying each other anything, nor have
I received any merchandise that would normally carry a licensing fee.
Today's essay is the very first in the
>User-Submitted Anlaysis Requests series. If you would like to see an analysis
of your schema, follow that link and contact me.
This blog has a Complete Table of Contents and a list
of Database Skills.
Brief Description and Starting Point
To get us started, I am going to quote the >Infovark Product Page, and then we will see what we want to zoom in on:
Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date
So we know even before we see anything technical that we are going to have
tables of contacts, emails, phones, addresses, appointments and many other
things pulled in from email systems, plus the value-add provided by the
product.
The Schema As-Is
We are going to start by looking at how the details of a CONTACT
are stored. The schema models contacts with a group of
cross references, aka many-to-many relationships, like so:
CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
|
+----- CONTACTS-X-PHONES -------- PHONES
|
+----- CONTACTS-X-ADDRESSES ----- ADDRESSES
|
+----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES
The first thing we have to note is that there is nothing wrong
with this at all. It is fully normalized and so it will be
very easy to make sure that database writes will not produce
anomalies or bad data.
But, not surprisingly, Mr. Thrasher notes this makes for complicated
SELECTS, so we want to ask if
perhaps it is over-normalized, are there complications
in there that do not need to be there?
Email as A Property of Contact
If I were to
>follow my own advice, I would first want to identify the
master tables. Master tables generally represent real things in the
world: people, places, products, services, events.
So my first question is this: is an email address a free-standing
entity in its own right that deserves a master table? Or is it
instead a property of the CONTACT? I am going to suggest that an
email address is a property of a CONTACT, and, since a CONTACT
may have more than one email address, they should be stored in
a child table of the CONTACTS, more like this:
CONTACTS +-----CONTACTS-X-EMAILS -------- EMAILADDRESSES
+----- CONTACTEMAILADDRESSES
|
+----- CONTACTS-X-PHONES -------- PHONES
|
+----- CONTACTS-X-ADDRESSES ----- ADDRESSES
|
+----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES
Whether or not we make this switch depends not on
technical arguments about keys or data types, but on
whether this accurately models reality. If in fact
email addresses are simply properties of contacts, then
this is the simplest way to do it. Going further, the
code that imports and reads the data will be easier to
code, debug and maintain for two reasons: one, because
it is simpler, but more importantly, two, because it
accurately models reality and therefore will be easier
to think about.
If this proves to be the right way to go, it may be
a one-off improvement, or it may repeat itself for
Phones, Addresses, and Web Addresses, but we will take
that up in the next post in the series.
I am going to proceed as if this change is correct, and
ask then how it will ripple through the rest of the
system.
Some Specifics on the Email Addresses Table
The EMAILADDRESSES table currently has these columns:
-- SQL Flavor is Firebird
CREATE TABLE EMAILADDRESS (
ID INTEGER NOT NULL,
USERNAME VARCHAR(64) NOT NULL COLLATE UNICODE_CI,
HOSTNAME VARCHAR(255) NOT NULL COLLATE UNICODE_CI,
DISPLAYNAME VARCHAR(255) NOT NULL
);
ALTER TABLE EMAILADDRESS
ADD CONSTRAINT PK_EMAILADDRESS
PRIMARY KEY (ID);
CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
IF (NEW.ID IS NULL) THEN
NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1);
END^
Suggestion: The first thing I notice is that the
complete email itself is not actually stored. So we need to
ask Mr. Thrasher what the thinking behind that was. My first
instinct is to store that, because it is the original natural
value of interest.
Suggestion: The columns USERNAME and HOSTNAME I could
go either way on. If they are needed for querying and statistics,
it is better to put them in. While this violates 3rd Normal Form and
so puts us at risk, the values are supplied AFAIK by a batch import,
and so there is only one codepath populating them, and we are likely safe.
However, if we DO NOT need to query these values for statistics,
and they are only there for convenience at display time, I would
likely remove them and generate them on-the-fly in application
code. There are some other good reasons to do this that will
come up a little further along.
Suggestion: Unless I missed something in the schema
sent over, we need a unique constraint on the combination of
CONTACTID and USERNAME and HOSTNAME. Or, if we remove
USERNAME and HOSTNAME in favor of the original EMAILADDRESS,
we need a unique constraint on CONTACTID + EMAILADDRESS.
Before We Get To Transactions
We are about to go into Part 2, which is about the other
tables that reference EMAILADDRESSES, but before we do
let's look at what the two tables would be if we made all
changes suggested so far:
CONTACTS EMAILADDRESSES
------------ --------------------
ID (surrogate key)
CONTACT_ID --------& CONTACT_ID
other columns... EMAILADDRESS
LABEL
USERNAME (possibly removed)
HOSTNAME (possibly removed)
DISPLAYNAME
You may notice the LABEL column showed up out of nowhere.
That column was previously in the cross-reference. When
the cross-reference went away it landed in EMAILADDRESSES.
That column LABEL holds values like "work", "home" and
so on. It is supplied from whatever system we pull
emails from, and so we have no constraints on it or
rules about it.
Changing Emails And Transactions
Now we move on from the basic storage of EMAIL addresses
to the other tables that reference those addresses.
These are things like emails themselves with their lists
people sent to/from, and meetings, and presumably other
types of transactions as well.
When we look at transactions, which will reference
contacts and email addresses, we also have to consider
the fact that a CONTACT may change their email address
over time. Consider a person working for Vendor A, who
moves over to Vendor B. For some of the transactions
they will have been at Vendor A, and then going forward
they are all at Vendor B. This leads to this very
important question:
Do Transactions store details about the CONTACTS as
they were at the time of the transaction, or as they
are now?
In other words, if a CONTACT moves from one company to
another, and you look at a meeting with that person
from last year, should it link to where they are now?
Or should it be full of information about where they
were at the time?
The answer to this question is important because it
determines how to proceed on the two final points I
would like to raise:
- Should the various transactions have a foreign
key back to EMAILADDRESSES, or should they simply
link back to CONTACTS and contain the EMAILADDRESS
itself?
- Do we need an integer surrogate key on the
EMAILADDRESSES table, especially if we do not link
back to it?
First Final Suggestion
So the first of the final two suggestions is: maybe
the transactions tables should just link back to CONTACTID
and contain a freestanding EMAILADDRESS. The first argument for
this is that it preserves the history as it was, and
if that is what we want, then this accomplishes it.
The second argument is that by putting the actual value
instead of an integer key back to some table, we
simplify coding by removing a join.
The arguments against embedding the email address might
be basically, "hey, if this is a kind of a data warehoues,
you are really supposed to be doing the snowflake thing and
you don't want to waste space on that value." To which I
respond that the engineer always has the choice of trading
space for speed. Putting the email in directly is a correct
recording of a fact, and takes more space, but eliminates
a very common JOIN from many queries, so Mr. Thrasher may
choose to make that call.
This also plays back to my question about whether we should
have USERNAME and HOSTNAME in the EMAILADDRESSES table.
If we start putting email addresses directly into tables,
we can also keep putting these other two columns in, which
trades again space for speed. We could also skip them and
code a parser in the application that generates them
on-the-fly as needed.
Second Final Suggestion
Now we go all of the way back to the child table
and ask a basic question: Why is there is an integer
surrogate key there? Integer surrogate keys are
useful in many situations, but contrary to what the
web generation learned, they are not some kind of
required approach in relational databases.
Consider: we need a unique constraint on CONTACTID+EMAILADDRESS
anyway, so we have to justify why we would add a new
column that does not add value. The reflex answer tends
to be "because they join faster" but that ignores the fact
that if you use the natural key of CONTACTID+EMAILADDRESS,
and put these columns into child tables, you do not need
to join at all! If we use the surrogate key and embed
it in child tables, then getting the CONCTACT information
forces two joins: through EMAILADDRESS to CONTACTS. But if
we use the natural key of CONTACTID + EMAILADDRESS we
already have the contact id which saves a JOIN when we
are after CONTACTS details, and, unless we want to know
something like LABEL, we do not have to JOIN back to
EMAILADDRESSES at all.
Conclusion
Well that's it. As promised, we have a few suggestions
and a lot of questions for Mr. Thrasher. Check back in the
coming days to see how the various questions work themselves
out in the comments.
No comments:
Post a Comment