Database performance tuning is a pretty strong interest of mine, and I’ve been lucky enough to get to work on some really cool database-intensive projects in my time at Atlantic BT. I’ll be the first to admit that I’m still not an expert on the topic and probably won’t be for some time. But having said that, I’ve had to pick up on at least a few things to get by, and I’d like to share some of those over the course of my next few blog posts.
Two Common, But Easily Correctable Problems
Two of the most common problems I’ve seen in databases are a lack of primary and foreign key constraints and poorly typed data. These problems can and will lead to performance issues, and even worse, bad data. Luckily, these are also two of the easiest problems to correct.
Keys
If you want to increase the performance of your database, then here’s a great starting point. Make sure that your tables have keys. EVERY table should have at least a primary key. But you shouldn’t stop there; you should also have foreign key constraints on many of your tables (keep in mind we call them “relational databases” for a reason – so create some relationships between the tables!). This has two benefits.
The first is the preservation of data integrity. Consider a simple example for a moment – two tables, Members and Address. The Members table has 2 columns, Id and MemberName. The Address table has a MemberId column and all the expected fields for addresses. Let’s suppose there is an undetected bug in an application that updates or inserts to these tables. Now think about the application that would consume the data. What happens when two members have the same ID? What about when the MemberId in the Address table doesn’t correspond to an actual Id in the Members table?
In software development, bugs are an unfortunate reality. Having strict key constraints in your database will add an additional degree of safety to your applications. In our example above, the Members table should have the Id column marked as a primary key and the Address table should have it’s MemberId column marked as a foreign key to the Member table’s Id field.
The second benefit is increased performance. Foreign keys will help the optimizer establish relationships between tables and primary keys will auto-generate a clustered index on the table. I’ll touch on this in more detail later in this post as well as in my next post but these both offer gains in performance.
Data Types
Another common problem is poorly typed data. An extreme example of this is every column in a table being labeled as varchar(max) and allowing nulls, but lesser offenses can still cause problems. Pay attention to the type, max length, and nullability of your columns and make wise decisions in setting these properties. In general, your decisions should demonstrate simple logic – use varchar(2) instead of varchar(max) if you’re storing State abbreviations, use int instead of varchar if you’re storing a number, etc.
An Example
Both of these concepts are easy to demonstrate. If you have access to a testing environment, run the following script to create a couple of test tables.
CREATE TABLE [dbo].[WidgetDescription] ( [Id] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [ItemName] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [Description] varchar(255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ) GO CREATE TABLE [dbo].[WidgetCount] ( [Id] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [WidgetId] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [NumberProduced] varchar(255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ) DECLARE @i int SET @i = 0 WHILE @i < 100000 BEGIN INSERT INTO [WidgetDescription] (Id, ItemName, Description) VALUES (cast(@i as varchar), 'Widget' + cast(@i as varchar), 'Description of the Widget') INSERT INTO [WidgetCount] (Id, WidgetId, NumberProduced) VALUES (cast(@i as varchar), cast(@i as varchar), cast(@i as varchar)) SET @i = @i + 1 END
We’ve created 2 (poorly designed) tables that contain information about widgets. Now let’s examine some simple queries. Note that I’m using STATISTICS TIME to monitor the run-time of the queries. To see this information, click the Messages tab in the output when the query is complete.
SET STATISTICS TIME ON SELECT NumberProduced FROM WidgetCount WHERE WidgetId = 1 SET STATISTICS TIME OFF SET STATISTICS TIME ON SELECT Description FROM WidgetDescription WHERE Id = 1 SET STATISTICS TIME OFF
My average run-time for the queries is in the 24 ms and 26 ms range, respectively.
Now let’s take some simple measures to improve the tables and see the difference it makes. First we’ll just make some simple data-type changes to the WidgetCount table. Our columns are meant to represent numbers, so they should all be of type int. For the WidgetDescription table, we’ll make similar data type changes and add a primary key. Make the Id column an int, set it to Identity with auto-increment, and make it a primary key. Shorten the ItemName column to only allow 25 characters and shorten the Description column to allow 100. (In a production environment, we would also set up a foreign key constraint between the WidgetId column of the WidgetCount table and the Id column of the WidgetDescription table, but this isn’t necessary for this example.)
Now run the queries again, and you should see run-times averaging around 8 ms and 1 ms respectively. As you can see, changing the data types boosted performance, and adding a primary key additionally boosted performance. If a difference of a few milliseconds doesn’t seem like a big deal, try thinking of it this way. In the second query, making a few simple changes to our table resulted in a performance boost of 2500%. That is indeed a big deal. It might not be so noticeable when dealing with milliseconds, but when you bring a query that takes 10 seconds to run down to under 1, it will be.
So how do we explain this boost? Two things. As for the data types, ints use less disc space for storage than varchars, and varchar(25) and varchar(100) use less than varchar(255). So there is simply less physical data to actually inspect and return. But the bigger difference is the primary key. Without the key, the optimizer can’t know that the values in the Id column are unique. So even though it finds a match after examining one row of data, it has to look at the other 99,999 to make sure there isn’t another match. However, once we add the key constraint, the optimizer knows it can stop looking after it finds a single match. The optimizer is doing a table scan (slow) in one instance and an index seek (fast) in the other. Again, I’ll cover this in more detail in my next post.
To summarize, I’ve outlined two very easy rules to follow to ensure that your databases are well designed. You should also start to see some performance boosts as a result of enforcing these rules. In my next post, I’ll show you how to examine the optimizer’s execution plan to create indexes on your tables in order to further improve your database’s performance.