A database is not much without the applications
that connect to it, and one of the most important
factors that affects the application's performance
is how it retrieves data from queries. In this essay
we are going to see the effect of round trips
on application performance.
This blog has two tables of contents, the
Complete Table of Contents and the list
of
Database Skills.
Pulling 15,000 Rows
The test will pull 15,000 rows from a
table. We do it three different ways and see
which is faster and by how much.
Getting a Lot of Rows
The script below creates a table and puts 1 million
rows into it. We want far more rows in the table than
we will actually pull so that we can pull fresh rows
on every pass through the test. It is deliberately crafted to spread
out the adjacent values of the integer primary key.
This is because, inasmuch as can control what is
going on, we want
every single row to be on a different page, so that
in all tests the cost of retrieving the row is roughly
the same and we are measuring only the effect of our
retrieval methods.
The script can be run without modification in pgAdmin3,
and with slight mods on MS SQL Server.
create table test000 (
intpk int primary key
,filler char(40)
)
-- BLOCK 1, first 5000 rows
-- pgAdmin3: run as pgScript
-- All others: modify as required
--
declare @x,@y;
set @x = 1;
set @y = string(40,40,1);
while @x <= 5000 begin
insert into test000 (intpk,filler)
values ((@x-1)*200 +1,'@y');
set @x = @x + 1;
end
-- BLOCK 2, put 5000 rows aside
--
select * into test000_temp from test000
-- BLOCK 3, Insert the 5000 rows 199 more
-- times to get 1million altogether
-- pgAdmin3: run as pgScript
-- All others: modify as required
--
declare @x;
set @x = 1;
while @x <= 199 begin
insert into test000 (intpk,filler)
select intpk+@x,filler from test000_temp;
set @x = @x + 1;
end
Test 1: The Naive Code
The simplest code is a straight loop that
pulls 15,000 consecutive rows by sending
an explicit query for each one.
# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");
# Program 1, Individual explicit fetches
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 1, using $x1 to $x2";
$timeBegin = microtime(true);
while ($x1++ <= $x2) {
$dbResult = pg_exec("select * from test000 where intpk=$x1");
$row = pg_fetch_array($dbResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 1, elapsed time: ".$elapsed;
echo "\n";
Test 2: Prepared Statements
The next command asks the server to prepare a
statement, but it still makes 15,000 round trips,
executing the prepared statement with a new parameter
each time. The code looks like this:
# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");
# Program 2, Individual fetches with prepared statements
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 2, using $x1 to $x2";
$timeBegin = microtime(true);
$dbResult = pg_prepare("test000","select * from test000 where intpk=$1");
while ($x1++ <= $x2) {
$pqResult = pg_execute("test000",array($x1));
$row = pg_fetch_all($pqResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 2, elapsed time: ".$elapsed;
echo "\n";
Test 3: A single round trip
This time we issue a single command to retrieve
15,000 rows, then we pull them all down in one
shot.
# Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");
# Program 3, One fetch, pull all rows
$timeBegin = microtime(true);
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 3, using $x1 to $x2";
$dbResult = pg_exec(
"select * from test000 where intpk between $x1 and $x2"
);
$allRows = pg_fetch_all($dbResult);
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 3, elapsed time: ".$elapsed;
echo "\n";
Results
I ran this five times in a row, and this is what I got:
Naive 15,000 | Prepared 15,000 | One Round Trip |
~1.800 seconds | ~1.150 seconds | ~0.045 seconds |
Compared to the naive example, the set-oriented
fetch of al 15,000 rows in a single shot ran
40 times faster. This is what set-oriented
code does for an application.
While the prepared statement option ran faster than
the naive option, the
set oriented example still ran 25 times faster
than the repeated prepared statements.
I also re-arranged the order of the tests, and
the results were the same.
Does Server or Language Matter?
So this test was done using PHP against PostgreSQL,
will other servers and client languages get different
results? Given the same hardware, a different client
language or server is going to have a different spread
but the shape will be the same. Fetching all rows in
a single shot beats the living frack out of round trips
inside of loops in any client language against any
server.
Putting It Into Use
The most obvious conclusion is that any query
returning more than 1 row should return all rows
as a set. The advantage is so stark with large
row counts that it is worthwhile making this the
default for our applications, unless we can find
a very good reason not to. So what would the
objections be?
One objection might go something like, "Ken, I
see the numbers, but I know my app very well and
we never pull more than 10-20 rows in a pop. I
cannot imagine how it would matter at 10-20 rows,
and I do not want to recode." This makes sense
so I ran a few more
tests with 20 and 100 rows, and found that, on
my hardware, you need about 100 rows to see a
difference. At 20 rows all three are neck-in-neck
and at 100 the set is pulling 4 times faster than
the prepared statement and 6 times faster than the
naive statement. So the conclusion is not an
absolute after all, some judgment is in order.
Another thing to consider is how many simultaneous
reads and writes might be going on at any given
time. If your system is known to have
simultaneous transactions running regularly, then the
complete fetch may be a good idea even if you do some
tests for best-guess row count and the tests are inconclusive.
The reason is that the test is a single user case,
but multiple simultaneous users put a strain on
the database, even when they are not accessing the same
tables. In this case we want the application to
play the "good citizen" and get in and out as quickly
as possible to reduce strain on the server, which will
improve the performance of the entire application, not
just the portions optimized for complete fetches.
Another objection might be, "Well, my code needs to
pull from multiple tables, so I cannot really do this.
When we do -PROCESS-X- we go row by row and need to pull
from multiple tables for each row." In this case
you *definitely* need to go set oriented and pull all
associated quantities down in a query with a JOIN or two.
Consider this, if on your particular hardware the ratio
of naive row-by-row to single fetch is 10, and you must
pull from 2 other tables for each row, that means you are
really running 30 times slower (ratio is 10 x 3 reads)
than you could be.
A Final Note About PHP, Data Structures, and Frameworks
Back when dinosaurs ruled the Earth and there was
no internet (outside of Universities, etc),
the languages we used had specialized data structures
that were tuned to database use. Compared to those
older systems the newer languages born on the
internet are more or less starving for such a
data structure.
PHP gets by fairly well because its associative
array can be used as a passive (non object-oriented)
data structure that comes pretty close to what we had
before.
I bring this up because the choice of a language and
its support for a "fetch all" operation obviously
impacts how well the conclusions of this post can
be implemented. If your mapping tool has an iterator
that absolves you of all knowledge of what is going
on under the hood, it may be worthwhile to see if it
is doing a complete fetch or a row-by-row.
No comments:
Post a Comment