AS Dev Database | Developer Drain Brain

December 2, 2009

How to fragment your database

Filed under: Development — Tags: AS Dev Database — rcomian @ 12:19 pm

One thing that can affect an index’s performance is fragmentation.

Fragmented indexes are just inefficient indexes where the storage for them has been scattered about the storage area, meaning that the engine can’t use any read-ahead optimisations to yank in the data quickly, and resorts to flitting the read heads around the disk like a humming bird on speed.

Clustered indexes are the indexes that determine the order that data is stored on disk. If your clustered index is fragmented, not only is the read head flitting around to load in the index, it’s also flitting around to load in the data.

In this way, fragmented indexes can kill performance on any media that has a seek time. If the data and the index can be kept in ram, then it’s not too much of a problem, but before the data’s cached, or once your workloads become larger and hits a rotating disk, you’re out of luck. Fragmented indexes are also larger than non-fragmented indexes as they contain a lot of wasted space, which increases the chance that you’ll need to hit the disk.

There’s a very efficient way to fragment your clustered index. When you create a clustered index, what you’re saying is “This is the order that data is going to be stored in”. If you insert data in that order, then it just gets continuously appended to the data store in a very fast and efficient operation. If you don’t insert data in that order, then you’re going to fragment your index.

Mis-ordered data will force the engine to go back into the nice clean store and split one of the nice storage pages into 2 to make room for the new data. The new page goes at the end of the storage, meaning that the index now has 1 page out of order. If you’re inserting a lot of mis-ordered data, then rinse and repeat. A single page could get filled and split hundreds or thousands of times, littering pages all over the store.

You can give yourself some headroom for these events by using a fill factor with your indexes. This tells the engine to only fill pages so full (say 80%), before starting a new page, in this way giving you space to insert mis-ordered data at a later date without creating a new page.

This is fine if data is mainly ordered and sometimes comes in disordered, but it won’t particularly help if you’re inserting too much mis-ordered data, as it will just spill over the page regardless of how much headroom you leave.

One example is Guids. Using a random guid as an identifier is ok, but as a clustered index column it’s a nightmare. Sequential Guids generated by the db are ok, but if they’re assigned by the client, or with the NEWID() function, you’re basically inserting random data into your ordered index. This can’t end well. You may do this without realising it, since Management Studio to to make your primary key into a clustered index by default, so if you’ve got a Guid as a primary key, chances are it’s also a clustered index.

This is a SQL script to demonstrate fragmentation on clustered guid indexes:

CREATE TABLE dbo.GuidTable ( GuidColumn uniqueidentifier NOT NULL, SomeData int NULL )
GO
ALTER TABLE dbo.GuidTable ADD CONSTRAINT PK_GuidTable PRIMARY KEY CLUSTERED ( GuidColumn )
GO

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')

SET NOCOUNT ON

DECLARE @Counter INT
SET @Counter = 0
WHILE @Counter < 1000000
BEGIN

 SET @Counter = @Counter + 1

 INSERT INTO GuidTable (GuidColumn, SomeData) VALUES (NEWID(), @Counter)
END

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')
GO

ALTER INDEX PK_GuidTable ON GuidTable REBUILD WITH (ONLINE = ON)

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')
GO

DROP TABLE GuidTable
GO

This script should give 3 resultsets as output. These are reports into the fragmentation at 3 points – after table creation, after data loading and after defragmentation.

You should see that after loading the fragmentation is up over 98%. After defragmentation, it’s down to around 1%.

In general, it’s good housekeeping to keep your indexes at low defragmentation. Obviously, if the fragmentation doesn’t affect the performance of the database your working with, then it’s a waste of time worrying about it. But do be aware of it, because it’s possible that if your workload increases, you could reach a tipping point where the suddenly fragmentation does matter, and matter bigtime.

We can always defrag indexes, the script above shows how to do this in Sql2005+. But prevention is usually better than cure, so think about what you’re saying when you create a clustered index. Think about the order that data actually gets inserted in and whether that could have an effect the performance as a whole, not just the time to insert the data.

There are many other scenarios where fragmentation can occur. For example, if you’re loading in sales data from remote shops on a nightly job, you may think it sensible to order that data by the purchase time. But if you load in one shop’s data, then the next, then the next, even if you order each shop’s transactions, the times may look like this as they get loaded in: 9:01, 10:30, 14:35, 8:45, 10:52, 18:16, 9:11, 9:16, 14:30 – data that is very effectively mis-ordered. This may not be as big a problem as Guids, as you might be able to determine a fill factor that actually gives you enough space to load in the shops data without fragmenting too much, but again, it’s something to be aware of when designing your databases.