Why GUIDs Are a Bad Choice for SQL Server Key and Identity Columns
I've spent many years debating with my fellow developers across all manor of subjects (I do love to talk), but one subject that comes up time and time again is the usage of the UNIQUEIDENTIFIER data type in SQL Server; especially when they're used as identity and key columns. In fact, I see this more often than you would expect, and misconfigured UNIQUEIDENTIFIER columns can create "hidden" problems that can be difficult to discover and / or rectify, depending on the SQL experience throughout your team.
If you care to ask around your colleagues, it's almost guaranteed that you'll get multiple conflicting opinions on why you should or should not use GUIDs in SQL Server, but few developers actually realise the impact of such a design choice. In the interests of science, I'll try and keep this as factual as possible, and I'll focus on the topic at hand, rather than GUIDs / UUIDs in general.
Here's a list of common reasons for using a GUID in your database design:
So here's a list of reasons - with code examples and supporting facts - as to why GUIDs are a bad design choice when it comes to SQL Server identity and key columns. I'm using primitive benchmarking techniques, but it still should give you a good idea of what you're up against - I will be inserting 1 million records into each table I test, and each table will have a single column with the data type and / or indices that are being tested.
First off, create a database - I've called mine "test", cos that's how I roll.
And before I begin my assault, let's write a bit of SQL to compare the three data types:
So, here we're inserting 1 million records with their default values into 4 separate tables. I've written two tests for the UNIQUEIDENTIFIER data type - one to test Identity creation using NEWID() and NEWSEQUENTIALID() respectively. The results are predictable, but interesting. I've added a little bit extra formatting for clarity:
So, on to the facts. INT columns have a numerical range of -2^31 to 2^31, and store only 4 bytes. BIGINT columns have a numerical range of -2^63 to 2^63, taking up 8 bytes. UNIQUEIDENTIFIER columns can have 2^122 possible combinations, but store a whopping 16 bytes; this results in noticeably larger table sizes when using a UNIQUEIDENTIFIER over the recommended key data types.
You'll have probably noticed that MyGuid is larger than MyGuidSeq; this is because SQL Server tables with a clustered index are stored in the indices physical order, something which I was unaware of until I posted this stack overflow question! I actually thought that clustered indices were a separate entity, stored along side the table.
So then, to make it űberfair, let's take the MyGuidSeq table and work out how much more storage is uses than the MyBigInt table: 24,986 - 16,904 = 8,064Kb extra storage per 1 million rows at its best, 21,824Kb at it's worst. Not much on its own, but consider that a table is likely to be much larger than that, and using the column as a foreign key in other tables means your storage requirements will rapidly expand.
Index fragmentation occurs when a new page is inserted into an existing index, and the index does not have space for the page. Pages can be inserted or moved around for lots of different reasons, but the most common are inserting new records, and updating indexed columns in existing records.
The following SQL will give us stats on all four tables:
And here are the results:
So why does NEWID() fragment the index so much? Well, NEWID() produces a pseudo-random GUID which is pretty much unpredictable, and 100% non-sequential. By their very nature, non-sequential rows are inserted into the middle of an index, whereas sequential rows are generally appended to the end of an index. When an insertion takes place, SQL Server will move pages around so that the new rows will fit in the space allocated to that particular page. If there's no space available in the page - or a new page won't fit in the space available - SQL Server will create the page at the end of the index. This causes index fragmentation.
Let's just take a very quick look at how quickly the MyGuid table gets to 99% fragmentation:
Output:
99% fragmentation after 23495 records
So, without anything else going on with the table, we can insert 23,495 records before the table is fully fragmented. As it stands, this doesn't mean too much - as you'd want to avoid getting to 99% fragmentation anyway (by managing any indices that become more than 5% fragmented - but that's a whole other article). So, I also ran the above test with 5% and 30% fragmentation values, and both got to 324 records before becoming fragmented. Regardless of the reasons for this, I think you'll agree that having to perform index maintenance after 324 new records is pretty unreasonable.
We haven't even touched upon the differences between Internal and External fragmentation, CLUSTERED and NON CLUSTERED indices, and how to set up index specific properties such as fill factors. There is a lot more to database design than initially presents itself; I remember learning all about normal form at college, which - whilst important - doesn't prepare you for all the funky under-the-hood mechanics that goes on in the real world.
I won't go into any more detail here, as there are hundreds of articles out there explaining fragmentation and at what point you should reorganize or rebuild an index. A while back I also wrote a script to assist with automated index management which I still find quite handy, though it should be noted that the more complex your database, the more specific you should be about individual indices.
The only thing that's left for me to say on index fragmentation is this: if you absolutely must use a UNIQUEIDENTIFIER as your primary key (you don't, btw), then ALWAYS use NEWSEQUENTIALID() over NEWID(). However, if your reason for using a GUID is that you want a unique identifier that's difficult to guess, then you're shit out of luck. Unfortunately NEWSEQUENTIALID() generated identifiers are predictable.
NEWSEQUENTIALID() also inherently helps to keep indices contiguous - filling data and index pages fully before creating a new page. However, it stands to reason that - due to the fact NEWSEQUENTIALID() is reset when Windows is restarted (and can subsequently start at a lower GUID seed value than the last seed) - that the use of NEWSEQUENTIALID() can still insert pages into the middle of an index, fragmenting the index quicker than you otherwise would with an INT or BIGINT.
Also worth mentioning - if you ever update a UNIQUEIDENTIFIER column that is indexed, even if you use NEWSEQUENTIALID(), it will increase fragmentation quicker over time. Saying that, the same would apply to an INT or BIGINT column, as the new value is likely to create a new page in the index.
Back with me? Good... so, what we're doing here is generating some primary key values in three tables (the ones suffixed with Join) as we did before. After that, we're randomly inserting some of those foreign key values into each source table, so we can perform lookups and joins against them. The inefficient bit is getting the random foreign key value from the Join tables - so for the int and bigint tables, I've just generated a random value between 1 and 1,000,000 (as we can guess the PKs for those tables). However, because I'm unable to re-generate the GUID values out of thin air (like I did with the int and bigint keys), I have to retrieve a random row from the MySeqGuidJoin table directly. I've used the most efficient method I could muster (generating a checksum for each row and only processing the top 1% of records), which is nearly as efficient as the random number generation, but a lot more cumbersome. I initially used the ORDER BY NEWID() method to get random values, but this is way too slow and inefficient.
Ok, so let's just check the fragmentation on the newly created tables as before to make sure we're getting similar results. I won't paste the SQL again (you can re-use the sys.dm_db_index_physical_stats statements above and apply the new table names):
So, things are looking good to start with. Low fragmentation on all the indices, and we have 1 million records in each of the 6 tables for us to test with.
First the join speed tests. We're going to return all 1,000,000 records in each source table (MyInt, MyBigInt and MyGuidSeq) and time the results.
And the results:
So, for a table that contains 1 million records, int and bigint perform similarly. UNIQUEIDENTIFIER increases the lookup by around 0.8 seconds (I ran this a few times and the results were consistent). This isn't the end of the world, but this still shows that GUID lookups are slower than both INT and BIGINT; and remember, these tables have no other columns, constraints, indices or keys - just a single primary key (and its clustered index), and a foreign key constraint.
I ran the same tests above with 10,000,000 rows to see what impact it would have on larger data sets. Here are the results - suffice to say that my dev PC was out of order for the afternoon whilst it generated over 1.5GB of data:
So, about 11 seconds difference - which makes it a bit more obvious that GUID keys are slower than bigint, and int. These numbers include the time it takes to retrieve the data from the server - which probably makes this specific test a bit unfair.
These times may not seem like much, but yet again, it will be further affected by larger table sizes, larger indices, more complex queries.
OK. This is a bit of a daft point to make, but it's still valid in my sick little mind. Let's start with the facts: there can be 2^122 GUID possibilities (that's 5,316,911,983,139,663,491,615,228,241,121,400,000 - a number I would struggle to say out loud, let alone ever expect to reach in terms of the number of rows in a table). This makes it extremely unlikely you'll ever collide with another GUID, but it is still possible, and as a defensive programmer, I try my best to cater for any and all possibilities. The more GUIDs you generate, the more likely it is you'll collide with an already generated GUID.
Using NEWSEQUENTIALID() gets around the uniqueness issue, although theoretically, it could make things even worse. The more often you restart the SQL box, the more likely it is that you'll get a conflict - and if you do, it won't be possible to insert another record until you restart the server. If you get a conflict in this situation, it's because the conflicted GUID, plus the range of sequential GUIDs generated after the conflicted GUID, have already been generated.
The possibility of a GUID conflict is insignificant. For this reason, most developers and database designers will largely ignore uniqueness as a concern when considering GUIDs as primary keys. It's a fair point, and not one I argue against very often - but I can tell you that I'm weirdly looking forward to the day I see a GUID conflict, so I can make notes, screenshot it and use it as physical proof that I was right all along! :P
So, I'm looping 100,000,000 times for each test and timing the results:
100,000,000 records is a fair amount to test with, but it still shows that UNIQUEIDENTIFIER comparisons are slower than INT.
Both SELECT statements above return the same results in this instance, but they wouldn't 100% of the time. It is extremely handy having both @@IDENTITY and SCOPE_IDENTITY() immediately available for INT and BIGINT columns, even if you don't use them both that often (you should use SCOPE_IDENTITY() most of the time).
We already know that a UNIQUEIDENTIFIER column cannot have an IDENTITY specification, and that @@IDENTITY and SCOPE_IDENTITY() both return INT or BIGINT, but what happens when you want the most recently inserted DEFAULT value, such as the NEWSEQUENTIALID() default defined below?
In the above instance, a simple SELECT will do the trick, but when you have multiple rows and multiple columns, it becomes impossible. When using NEWSEQUENTIALGUID() you can't rely on an ORDER BY Id DESC clause to get you the latest value, because if the server restarts, it's entirely possible for the new DEFAULT seed value to start below the last seed value.
There is a solution, but it is (in my opinion) cumbersome, more intensive (if only marginally), and only caters for the current scope:
So, we CAN get the newly inserted GUID value, but it's not pretty.
Say, for example, you store a GUID as an SSIS package variable - first of all, you have to store it as a string. While this is all well and good, it's when you come to compare the variable in an SSIS expression (such as inside an Execute SQL Task or OLE DB Source) that it starts to become difficult. Take the following expression:
[CustomerId] == @[User::CustomerIdVariable]
CustomerId is a UNIQUEIDENTIFIER column (DT_GUID in SSIS), and CustomerIdVariable is (surprise, surprise) a string variable storing a GUID. One would expect an implicit cast to take place, but noooooo... instead, we get error after error until we stumble upon the correct cast, conversion and comparison. So, let's try casting CustomerIdVariable to a DT_GUID:
[CustomerId] == (DT_GUID)@[User::CustomerIdVariable]
Nope, doesn't like that... after a bit of Googling, you'll eventually discover that you should wrap any GUID data in curly brackets in order to work with it:
[CustomerId] == (DT_GUID)("{" + @[User::CustomerIdVariable] + "}")
But that doesn't work either:
(DT_WSTR,36)[CustomerId] == @[User::CustomerIdVariable]
But guess what? This STILL doesn't work... the actual solution is to to convert the UNIQUEIDENTIFIER column to a string, and wrap the variable in curly brackets, as such:
(DT_WSTR,36)[CustomerId] == ("{" + @[User::CustomerIdVariable] + "}")
And I'm not going to even get into non-unicode conversions.
This is an issue with SSIS rather than the use of a UNIQUEIDENTIFIER in your table, but it's still worth bearing in mind when designing your DB. If all of your primary and foreign keys are GUIDs, your BI guys aren't gonna thank you for it.
If you genuinely need a unique and unpredictable GUID for any reason, then simply create a separate UNIQUEIDENTIFIER column in your table, but don't index it or use it as a foreign key:
If you want to assure that UniqueKey is actually Unique, then add a Unique constraint to the column. However, be aware that by creating a UNIQUE constraint, you are also creating an index - which will need to be set up and maintained appropriately. The main thing we're trying to avoid here is having a fragmented clustered index.
INT and BIGINT outperform UNIQUEIDENTIFIER in every test I've performed here (and I challenge you to come up with a test where they don't) - some results are marginal, others are more obvious, but this speaks volumes. Your column keys should always be as small and efficient as possible, therefore using a 16 byte UNIQUEIDENTIFIER is counter-productive when a perfectly good 4 byte INT is sat there waiting for you.
Remember that the performance tests I've written here are only rough. Ideally each of these tests should be executed on a server a number of times and the results aggregated, as it's very likely that other applications on my dev machine are affecting SQL Server's I/O, CPU and memory usage. However, I think they serve to demonstrate my concerns well.
Finally, if you're considering using a GUID as a table key, ask yourself the following questions:
http://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
If you care to ask around your colleagues, it's almost guaranteed that you'll get multiple conflicting opinions on why you should or should not use GUIDs in SQL Server, but few developers actually realise the impact of such a design choice. In the interests of science, I'll try and keep this as factual as possible, and I'll focus on the topic at hand, rather than GUIDs / UUIDs in general.
Here's a list of common reasons for using a GUID in your database design:
- You require an unpredictable seed for your records as it is exposed directly via some kind of public interface
- You need a globally unique identifier for the record.
- You have more than 2^31 (2,147,483,647) records in your table
- You have more than 2^63 (9,223,372,036,854,775,807) records in your table
- You use GUIDs all the time / you prefer GUIDs as they're a more secure solution
So here's a list of reasons - with code examples and supporting facts - as to why GUIDs are a bad design choice when it comes to SQL Server identity and key columns. I'm using primitive benchmarking techniques, but it still should give you a good idea of what you're up against - I will be inserting 1 million records into each table I test, and each table will have a single column with the data type and / or indices that are being tested.
First off, create a database - I've called mine "test", cos that's how I roll.
CREATE DATABASE test;
And before I begin my assault, let's write a bit of SQL to compare the three data types:
USE test; GO SET NOCOUNT ON GO IF OBJECT_ID('MyInt') IS NOT NULL DROP TABLE MyInt GO IF OBJECT_ID('MyBigInt') IS NOT NULL DROP TABLE MyBigInt GO IF OBJECT_ID('MyGuid') IS NOT NULL DROP TABLE MyGuid GO IF OBJECT_ID('MyGuidSeq') IS NOT NULL DROP TABLE MyGuidSeq GO CREATE TABLE MyInt ( ID INT PRIMARY KEY IDENTITY ) GO CREATE TABLE MyBigInt ( ID BIGINT PRIMARY KEY IDENTITY ) GO CREATE TABLE MyGuid ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID() ) GO CREATE TABLE MyGuidSeq ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID() ) GO DECLARE @StartTime DATETIME2, @Endtime DATETIME2, @i INT, @noRecords INT SET @noRecords = 1000000 -- MyInt SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noRecords) BEGIN INSERT INTO MyInt DEFAULT VALUES SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'MyInt Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' -- MyBigInt SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noRecords) BEGIN INSERT INTO MyBigInt DEFAULT VALUES SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'MyBigInt Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' -- MyGuid SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noRecords) BEGIN INSERT INTO MyGuid DEFAULT VALUES SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'MyGuid Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' -- MyGuidSeq SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noRecords) BEGIN INSERT INTO MyGuidSeq DEFAULT VALUES SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'MyGuidSeq Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' exec sp_spaceused 'MyInt' exec sp_spaceused 'MyBigInt' exec sp_spaceused 'MyGuid' exec sp_spaceused 'MyGuidSeq'
So, here we're inserting 1 million records with their default values into 4 separate tables. I've written two tests for the UNIQUEIDENTIFIER data type - one to test Identity creation using NEWID() and NEWSEQUENTIALID() respectively. The results are predictable, but interesting. I've added a little bit extra formatting for clarity:
MyInt Time Taken: 105,080ms MyBigInt Time Taken: 104,380ms MyGuid Time Taken: 159,670ms MyGuidSeq Time Taken: 102,950ms
Table name No. Rows Reserved Space Actual space Index Size Unused Space MyInt 1,000,000 12,936 KB 12,864 KB 56 KB 16 KB MyBigInt 1,000,000 16,904 KB 16,808 KB 88 KB 8 KB MyGuid 1,000,000 34,760 KB 34,552 KB 160 KB 48 KB MyGuidSeq 1,000,000 24,968 KB 24,768 KB 176 KB 24 KBAnd now the critique...
1. Table Storage Size
"Disk Space is Cheap!" I hear you cry. Indeed it is, but why use more than you have to? This is probably the most contentious (but probably least important) issue surrounding the use of GUIDs in SQL Server, mainly because it's easy to have an opinion on it. I think it's safe to say that most developers understand how basic disk allocation works - but often enterprise storage systems are a lot more complex than your developer box.So, on to the facts. INT columns have a numerical range of -2^31 to 2^31, and store only 4 bytes. BIGINT columns have a numerical range of -2^63 to 2^63, taking up 8 bytes. UNIQUEIDENTIFIER columns can have 2^122 possible combinations, but store a whopping 16 bytes; this results in noticeably larger table sizes when using a UNIQUEIDENTIFIER over the recommended key data types.
You'll have probably noticed that MyGuid is larger than MyGuidSeq; this is because SQL Server tables with a clustered index are stored in the indices physical order, something which I was unaware of until I posted this stack overflow question! I actually thought that clustered indices were a separate entity, stored along side the table.
So then, to make it űberfair, let's take the MyGuidSeq table and work out how much more storage is uses than the MyBigInt table: 24,986 - 16,904 = 8,064Kb extra storage per 1 million rows at its best, 21,824Kb at it's worst. Not much on its own, but consider that a table is likely to be much larger than that, and using the column as a foreign key in other tables means your storage requirements will rapidly expand.
2. Index Size
The above table also helps when comparing the size of UNIQUEIDENTIFIER, INT and BIGINT based indices. It's obvious for reasons already mentioned that a UNIQUEIDENTIFIER will use more space than an INT or BIGINT, and whilst the numbers in the Index Size column above are tiny for now, their sizes will increase considerably when more columns are included or referenced in the index. You can also see that both UNIQUEIDENTIFIER columns use roughly double or more space than even a BIGINT; the size of an index has repercussions, which I will address later in this article.3. Index Fragmentation
The fragmentation of an SQL Server index often goes unchecked - in fact, most developers that work with databases don't even know that indices can get fragmented, yet it can often be the first place to look when you run into sudden performance issues in your software.Index fragmentation occurs when a new page is inserted into an existing index, and the index does not have space for the page. Pages can be inserted or moved around for lots of different reasons, but the most common are inserting new records, and updating indexed columns in existing records.
The following SQL will give us stats on all four tables:
SELECT I.name, ROUND(S.avg_fragmentation_in_percent, 2) AS FragmentationPercent, i.fill_factor, s.page_count FROM sys.dm_db_index_physical_stats (DB_ID('test'), OBJECT_ID('MyInt'), NULL, NULL, NULL) S INNER JOIN sys.indices I (nolock) ON ( S.object_id = I.object_id AND S.index_id = I.index_id ) UNION SELECT I.name, ROUND(S.avg_fragmentation_in_percent, 2) AS FragmentationPercent, i.fill_factor, s.page_count FROM sys.dm_db_index_physical_stats (DB_ID('test'), OBJECT_ID('MyBigInt'), NULL, NULL, NULL) S INNER JOIN sys.indices I (nolock) ON ( S.object_id = I.object_id AND S.index_id = I.index_id ) UNION SELECT I.name, ROUND(S.avg_fragmentation_in_percent, 2) AS FragmentationPercent, i.fill_factor, s.page_count FROM sys.dm_db_index_physical_stats (DB_ID('test'), OBJECT_ID('MyGuid'), NULL, NULL, NULL) S INNER JOIN sys.indices I (nolock) ON ( S.object_id = I.object_id AND S.index_id = I.index_id ) UNION SELECT I.name, ROUND(S.avg_fragmentation_in_percent, 2) AS FragmentationPercent, i.fill_factor, s.page_count FROM sys.dm_db_index_physical_stats (DB_ID('test'), OBJECT_ID('MyGuidSeq'), NULL, NULL, NULL) S INNER JOIN sys.indices I (nolock) ON ( S.object_id = I.object_id AND S.index_id = I.index_id )
And here are the results:
name FragmentationPercent fill_factor page_count PK__MyGuidSe__3214EC276477ECF3 0.71 0 3096 PK__MyGuid__3214EC275FB337D6 99.08 0 4343 PK__MyInt__3214EC275812160E 0.44 0 1608 PK__MyBigInt__3214EC275BE2A6F2 0.48 0 2101There's a tiny amount of fragmentation on the INT and BIGINT tables, and a little bit more for UNIQUEIDENTIFIER when NEWSEQUENTIALID() is used, but as you can see, the MyGuid table (which uses NEWID() as the column default value) is almost 100% fragmented. This is bad, in case you haven't been following. Fragmentation slows down the retrieval of data from the table, meaning the SQL query engine has more work to do when it needs to scan the index.
So why does NEWID() fragment the index so much? Well, NEWID() produces a pseudo-random GUID which is pretty much unpredictable, and 100% non-sequential. By their very nature, non-sequential rows are inserted into the middle of an index, whereas sequential rows are generally appended to the end of an index. When an insertion takes place, SQL Server will move pages around so that the new rows will fit in the space allocated to that particular page. If there's no space available in the page - or a new page won't fit in the space available - SQL Server will create the page at the end of the index. This causes index fragmentation.
Let's just take a very quick look at how quickly the MyGuid table gets to 99% fragmentation:
USE test; GO SET NOCOUNT ON GO IF OBJECT_ID('MyGuid') IS NOT NULL DROP TABLE MyGuid GO CREATE TABLE MyGuid ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID() ) DECLARE @StartTime DATETIME2, @Endtime DATETIME2, @i INT, @noRecords INT DECLARE @Fragmentation DECIMAL(4,2) SET @noRecords = 1000000 -- MyGuid SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noRecords) BEGIN INSERT INTO MyGuid DEFAULT VALUES SELECT @Fragmentation = ROUND(S.avg_fragmentation_in_percent, 2) FROM sys.dm_db_index_physical_stats (DB_ID('test'), OBJECT_ID('MyGuid'), NULL, NULL, NULL) S IF(@Fragmentation > 99) BEGIN PRINT '99% fragmentation after ' + CONVERT(VARCHAR, @i) + ' records' BREAK END SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'MyGuid Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' exec sp_spaceused 'MyGuid', true
Output:
99% fragmentation after 23495 records
So, without anything else going on with the table, we can insert 23,495 records before the table is fully fragmented. As it stands, this doesn't mean too much - as you'd want to avoid getting to 99% fragmentation anyway (by managing any indices that become more than 5% fragmented - but that's a whole other article). So, I also ran the above test with 5% and 30% fragmentation values, and both got to 324 records before becoming fragmented. Regardless of the reasons for this, I think you'll agree that having to perform index maintenance after 324 new records is pretty unreasonable.
We haven't even touched upon the differences between Internal and External fragmentation, CLUSTERED and NON CLUSTERED indices, and how to set up index specific properties such as fill factors. There is a lot more to database design than initially presents itself; I remember learning all about normal form at college, which - whilst important - doesn't prepare you for all the funky under-the-hood mechanics that goes on in the real world.
I won't go into any more detail here, as there are hundreds of articles out there explaining fragmentation and at what point you should reorganize or rebuild an index. A while back I also wrote a script to assist with automated index management which I still find quite handy, though it should be noted that the more complex your database, the more specific you should be about individual indices.
The only thing that's left for me to say on index fragmentation is this: if you absolutely must use a UNIQUEIDENTIFIER as your primary key (you don't, btw), then ALWAYS use NEWSEQUENTIALID() over NEWID(). However, if your reason for using a GUID is that you want a unique identifier that's difficult to guess, then you're shit out of luck. Unfortunately NEWSEQUENTIALID() generated identifiers are predictable.
NEWSEQUENTIALID() also inherently helps to keep indices contiguous - filling data and index pages fully before creating a new page. However, it stands to reason that - due to the fact NEWSEQUENTIALID() is reset when Windows is restarted (and can subsequently start at a lower GUID seed value than the last seed) - that the use of NEWSEQUENTIALID() can still insert pages into the middle of an index, fragmenting the index quicker than you otherwise would with an INT or BIGINT.
Also worth mentioning - if you ever update a UNIQUEIDENTIFIER column that is indexed, even if you use NEWSEQUENTIALID(), it will increase fragmentation quicker over time. Saying that, the same would apply to an INT or BIGINT column, as the new value is likely to create a new page in the index.
4. Join and Index Lookup Speed
Ignoring the NEWID() tables I created earlier (I think it's safe to say that I have already provided enough evidence that NEWID() is not suitable for use in a default constraint on a key column), we can now take a look at how quickly joins and lookups perform across the data types. If you're coding along with me, the below SQL will take a while (we're generating and inserting 6 million rows, and doing 3 million individual lookups - I would NOT recommend doing anything even remotely like this on a production server). Set it running and go and make a brew, hug your wife, feed your cat, or snuggle your guinea pigs (you have guinea pigs, right?). This takes around 11 minutes to run on my development PC:USE test; GO SET NOCOUNT ON GO IF OBJECT_ID('MyInt') IS NOT NULL DROP TABLE MyInt GO IF OBJECT_ID('MyIntJoin') IS NOT NULL DROP TABLE MyIntJoin GO IF OBJECT_ID('MyBigInt') IS NOT NULL DROP TABLE MyBigInt GO IF OBJECT_ID('MyBigIntJoin') IS NOT NULL DROP TABLE MyBigIntJoin GO IF OBJECT_ID('MyGuidSeq') IS NOT NULL DROP TABLE MyGuidSeq GO IF OBJECT_ID('MyGuidSeqJoin') IS NOT NULL DROP TABLE MyGuidSeqJoin GO CREATE TABLE MyIntJoin ( ID INT PRIMARY KEY IDENTITY ) GO CREATE TABLE MyBigIntJoin ( ID BIGINT PRIMARY KEY IDENTITY ) GO CREATE TABLE MyGuidSeqJoin ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID() ) GO CREATE TABLE MyInt ( ID INT PRIMARY KEY IDENTITY, fk INT FOREIGN KEY REFERENCES MyIntJoin(ID) ) GO CREATE TABLE MyBigInt ( ID BIGINT PRIMARY KEY IDENTITY, fk BIGINT FOREIGN KEY REFERENCES MyBigIntJoin(ID) ) GO CREATE TABLE MyGuidSeq ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID(), fk UNIQUEIDENTIFIER FOREIGN KEY REFERENCES MyGuidSeqJoin(ID) ) GO DECLARE @StartTime DATETIME2, @Endtime DATETIME2, @i INT, @noRecords INT DECLARE @fkInt INT, @fkBigInt BIGINT, @fkGuid UNIQUEIDENTIFIER SET @noRecords = 1000000 -- Join tables -- MyInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN INSERT INTO MyIntJoin DEFAULT VALUES SET @i = @i + 1 END -- MyBigInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN INSERT INTO MyBigIntJoin DEFAULT VALUES SET @i = @i + 1 END -- MyGuidSeq SET @i = 1 WHILE (@i <= @noRecords) BEGIN INSERT INTO MyGuidSeqJoin DEFAULT VALUES SET @i = @i + 1 END -- Source Tables -- MyInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN -- Just get a random ID between 1 and @noRecords SELECT @fkInt = ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % @noRecords INSERT INTO MyInt(fk) VALUES(@fkInt) SET @i = @i + 1 END -- MyBigInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN -- Just get a random ID between 1 and @noRecords SELECT @fkBigInt = ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % @noRecords INSERT INTO MyBigInt(fk) VALUES(@fkBigInt) SET @i = @i + 1 END -- MyGuidSeq SET @i = 1 WHILE (@i <= @noRecords) BEGIN -- Getting a random row efficiently is a bit more complex SELECT TOP 1 @fkGuid = ID FROM MyGuidSeqJoin WHERE (ABS(CAST((BINARY_CHECKSUM(id, NEWID())) as int)) % 100) < 1 -- Only process 1% INSERT INTO MyGuidSeq(fk) VALUES(@fkGuid) SET @i = @i + 1 END
Back with me? Good... so, what we're doing here is generating some primary key values in three tables (the ones suffixed with Join) as we did before. After that, we're randomly inserting some of those foreign key values into each source table, so we can perform lookups and joins against them. The inefficient bit is getting the random foreign key value from the Join tables - so for the int and bigint tables, I've just generated a random value between 1 and 1,000,000 (as we can guess the PKs for those tables). However, because I'm unable to re-generate the GUID values out of thin air (like I did with the int and bigint keys), I have to retrieve a random row from the MySeqGuidJoin table directly. I've used the most efficient method I could muster (generating a checksum for each row and only processing the top 1% of records), which is nearly as efficient as the random number generation, but a lot more cumbersome. I initially used the ORDER BY NEWID() method to get random values, but this is way too slow and inefficient.
Ok, so let's just check the fragmentation on the newly created tables as before to make sure we're getting similar results. I won't paste the SQL again (you can re-use the sys.dm_db_index_physical_stats statements above and apply the new table names):
name FragmentationPercent fill_factor page_count PK__MyInt__3214EC272F9A1060 0.38 0 2101 PK__MyIntJoi__3214EC272334397B 0.37 0 1608 PK__MyGuidSe__3214EC2739237A9A 0.67 0 5077 PK__MyBigInt__3214EC272704CA5F 0.57 0 2101 PK__MyGuidSe__3214EC272AD55B43 0.71 0 3096 PK__MyBigInt__3214EC27345EC57D 0.45 0 3096
So, things are looking good to start with. Low fragmentation on all the indices, and we have 1 million records in each of the 6 tables for us to test with.
First the join speed tests. We're going to return all 1,000,000 records in each source table (MyInt, MyBigInt and MyGuidSeq) and time the results.
USE test; GO SET NOCOUNT ON GO IF OBJECT_ID('MyInt') IS NOT NULL DROP TABLE MyInt GO IF OBJECT_ID('MyIntJoin') IS NOT NULL DROP TABLE MyIntJoin GO IF OBJECT_ID('MyBigInt') IS NOT NULL DROP TABLE MyBigInt GO IF OBJECT_ID('MyBigIntJoin') IS NOT NULL DROP TABLE MyBigIntJoin GO IF OBJECT_ID('MyGuidSeq') IS NOT NULL DROP TABLE MyGuidSeq GO IF OBJECT_ID('MyGuidSeqJoin') IS NOT NULL DROP TABLE MyGuidSeqJoin GO CREATE TABLE MyIntJoin ( ID INT PRIMARY KEY IDENTITY ) GO CREATE TABLE MyBigIntJoin ( ID BIGINT PRIMARY KEY IDENTITY ) GO CREATE TABLE MyGuidSeqJoin ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID() ) GO CREATE TABLE MyInt ( ID INT PRIMARY KEY IDENTITY, fk INT FOREIGN KEY REFERENCES MyIntJoin(ID) ) GO CREATE TABLE MyBigInt ( ID BIGINT PRIMARY KEY IDENTITY, fk BIGINT FOREIGN KEY REFERENCES MyBigIntJoin(ID) ) GO CREATE TABLE MyGuidSeq ( ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID(), fk UNIQUEIDENTIFIER FOREIGN KEY REFERENCES MyGuidSeqJoin(ID) ) GO DECLARE @i INT, @noRecords INT DECLARE @fkInt INT, @fkBigInt BIGINT, @fkGuid UNIQUEIDENTIFIER SET @noRecords = 10000000 -- Join tables -- MyInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN INSERT INTO MyIntJoin DEFAULT VALUES SET @i = @i + 1 END -- MyBigInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN INSERT INTO MyBigIntJoin DEFAULT VALUES SET @i = @i + 1 END -- MyGuidSeq SET @i = 1 WHILE (@i <= @noRecords) BEGIN INSERT INTO MyGuidSeqJoin DEFAULT VALUES SET @i = @i + 1 END -- Source Tables -- MyInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN -- Just get a random ID between 1 and @noRecords SELECT @fkInt = (ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % @noRecords) + 1 INSERT INTO MyInt(fk) VALUES(@fkInt) SET @i = @i + 1 END -- MyBigInt SET @i = 1 WHILE (@i <= @noRecords) BEGIN -- Just get a random ID between 1 and @noRecords SELECT @fkBigInt = (ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % @noRecords) + 1 INSERT INTO MyBigInt(fk) VALUES(@fkBigInt) SET @i = @i + 1 END -- MyGuidSeq SET @i = 1 WHILE (@i <= @noRecords) BEGIN -- Getting a random row efficiently is a bit more complex SELECT TOP 1 @fkGuid = ID FROM MyGuidSeqJoin WHERE (ABS(CAST((BINARY_CHECKSUM(id, NEWID())) as int)) % 100) < 1 -- Only process 1% INSERT INTO MyGuidSeq(fk) VALUES(@fkGuid) SET @i = @i + 1 END
And the results:
MyInt Time Taken: 4,727ms MyBigInt Time Taken: 4,812ms MyGuidSeq Time Taken: 5,687ms
So, for a table that contains 1 million records, int and bigint perform similarly. UNIQUEIDENTIFIER increases the lookup by around 0.8 seconds (I ran this a few times and the results were consistent). This isn't the end of the world, but this still shows that GUID lookups are slower than both INT and BIGINT; and remember, these tables have no other columns, constraints, indices or keys - just a single primary key (and its clustered index), and a foreign key constraint.
I ran the same tests above with 10,000,000 rows to see what impact it would have on larger data sets. Here are the results - suffice to say that my dev PC was out of order for the afternoon whilst it generated over 1.5GB of data:
MyInt Time Taken: 47,357ms MyBigInt Time Taken: 48,970ms MyGuidSeq Time Taken: 58,940ms
So, about 11 seconds difference - which makes it a bit more obvious that GUID keys are slower than bigint, and int. These numbers include the time it takes to retrieve the data from the server - which probably makes this specific test a bit unfair.
These times may not seem like much, but yet again, it will be further affected by larger table sizes, larger indices, more complex queries.
4. Uniqueness
7895D7B4-6402-4552-A7CE-D3B48C5256CC
Apparrently, no one has ever seen the above string before now...
OK. This is a bit of a daft point to make, but it's still valid in my sick little mind. Let's start with the facts: there can be 2^122 GUID possibilities (that's 5,316,911,983,139,663,491,615,228,241,121,400,000 - a number I would struggle to say out loud, let alone ever expect to reach in terms of the number of rows in a table). This makes it extremely unlikely you'll ever collide with another GUID, but it is still possible, and as a defensive programmer, I try my best to cater for any and all possibilities. The more GUIDs you generate, the more likely it is you'll collide with an already generated GUID.
Using NEWSEQUENTIALID() gets around the uniqueness issue, although theoretically, it could make things even worse. The more often you restart the SQL box, the more likely it is that you'll get a conflict - and if you do, it won't be possible to insert another record until you restart the server. If you get a conflict in this situation, it's because the conflicted GUID, plus the range of sequential GUIDs generated after the conflicted GUID, have already been generated.
The possibility of a GUID conflict is insignificant. For this reason, most developers and database designers will largely ignore uniqueness as a concern when considering GUIDs as primary keys. It's a fair point, and not one I argue against very often - but I can tell you that I'm weirdly looking forward to the day I see a GUID conflict, so I can make notes, screenshot it and use it as physical proof that I was right all along! :P
6. Type Comparison
I decided to perform a little test on how SQL deals with comparisons between data types. I wanted to compare two INT variables, an INT and a VARCHAR, two UNIQUEIDENTIFIERs, then a UNIQUEIDENTIFIER and a VARCHAR. The only way I could think to reasonably test this was the following:USE test; GO DECLARE @noTests INT, @i INT DECLARE @StartTime DATETIME2, @EndTime DATETIME2 DECLARE @int1 INT, @int2 INT, @int3 VARCHAR(10) DECLARE @guid1 UNIQUEIDENTIFIER, @guid2 UNIQUEIDENTIFIER, @guid3 VARCHAR(36) SET @noTests = 100000000 SET @int1 = 76253421 SET @int2 = 93948585 SET @int3 = '348975764' SET @guid1 = NEWID() SET @guid2 = NEWID() SET @guid3 = NEWID() SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noTests) BEGIN IF(@int1 = @int2) BEGIN PRINT '' -- Will never get called END SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'Int1 = Int2 Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noTests) BEGIN IF(@int1 = @int3) BEGIN PRINT '' -- Will never get called END SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'Int1 = Int3 Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noTests) BEGIN IF(@guid1 = @guid2) BEGIN PRINT '' -- Will never get called END SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'Guid1 = Guid2 Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms' SET @i = 1 SET @StartTime = GETDATE() WHILE (@i <= @noTests) BEGIN IF(@guid1 = @guid3) BEGIN PRINT '' -- Will never get called END SET @i = @i + 1 END SET @EndTime = GETDATE() PRINT 'Guid1 = Guid3 Time Taken: ' + CONVERT(VARCHAR, DATEDIFF(ms, @StartTime, @EndTime)) + 'ms'
So, I'm looping 100,000,000 times for each test and timing the results:
Int1 = Int2 Time Taken: 50,043ms Int1 = Int3 Time Taken: 62,137ms Guid1 = Guid2 Time Taken: 52,333ms Guid1 = Guid3 Time Taken: 72,560msThe results clearly show that SQL server deals with strongly typed comparisons similarly - UNIQUEIDENTIFIER to UNIQUEIDENTIFIER comparisons are marginally slower than INT to INT (by 2 seconds), INT to VARCHAR takes considerably longer (12 seconds), and UNIQUEIDENTIFIER to VARCHAR is even slower (22 seconds).
100,000,000 records is a fair amount to test with, but it still shows that UNIQUEIDENTIFIER comparisons are slower than INT.
7. Retrieving recently inserted identity values
So, you've just inserted a record into a table, and you want to get the last generated IDENTITY value for use in your application. If you have any experience with T-SQL, I would imagine you're thinking about @@IDENTIFIER or SCOPE_IDENTITY() right now:INSERT INTO MyInt(fk) VALUES(2) SELECT @@IDENTITY SELECT SCOPE_IDENTITY()
Both SELECT statements above return the same results in this instance, but they wouldn't 100% of the time. It is extremely handy having both @@IDENTITY and SCOPE_IDENTITY() immediately available for INT and BIGINT columns, even if you don't use them both that often (you should use SCOPE_IDENTITY() most of the time).
We already know that a UNIQUEIDENTIFIER column cannot have an IDENTITY specification, and that @@IDENTITY and SCOPE_IDENTITY() both return INT or BIGINT, but what happens when you want the most recently inserted DEFAULT value, such as the NEWSEQUENTIALID() default defined below?
CREATE TABLE IdentityTest ( Id UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() ) SELECT Id FROM IdentityTest
In the above instance, a simple SELECT will do the trick, but when you have multiple rows and multiple columns, it becomes impossible. When using NEWSEQUENTIALGUID() you can't rely on an ORDER BY Id DESC clause to get you the latest value, because if the server restarts, it's entirely possible for the new DEFAULT seed value to start below the last seed value.
There is a solution, but it is (in my opinion) cumbersome, more intensive (if only marginally), and only caters for the current scope:
CREATE TABLE IdentityTest ( Id UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() ) DECLARE @IdentityTest TABLE ( Id UNIQUEIDENTIFIER ) INSERT INTO IdentityTest OUTPUT INSERTED.Id INTO @IdentityTest(ID) DEFAULT VALUES SELECT Id FROM @IdentityTest
So, we CAN get the newly inserted GUID value, but it's not pretty.
8. SSIS / BIDS / SSDT Handling
If you're writing an enterprise level database in SQL Server, it's very likely that at some point, you - or someone else - will run an SSIS package against it. If your primary keys are GUIDs, this creates more work for any BI tasks, as SSIS (both BIDS and SSDT) doesn't deal with GUID data very well.Say, for example, you store a GUID as an SSIS package variable - first of all, you have to store it as a string. While this is all well and good, it's when you come to compare the variable in an SSIS expression (such as inside an Execute SQL Task or OLE DB Source) that it starts to become difficult. Take the following expression:
[CustomerId] == @[User::CustomerIdVariable]
CustomerId is a UNIQUEIDENTIFIER column (DT_GUID in SSIS), and CustomerIdVariable is (surprise, surprise) a string variable storing a GUID. One would expect an implicit cast to take place, but noooooo... instead, we get error after error until we stumble upon the correct cast, conversion and comparison. So, let's try casting CustomerIdVariable to a DT_GUID:
[CustomerId] == (DT_GUID)@[User::CustomerIdVariable]
Nope, doesn't like that... after a bit of Googling, you'll eventually discover that you should wrap any GUID data in curly brackets in order to work with it:
[CustomerId] == (DT_GUID)("{" + @[User::CustomerIdVariable] + "}")
But that doesn't work either:
Error at Process results into database [Create additional columns [158]]: The data type "DT_GUID" cannot be used with binary operator "==". The type of one or both of the operands is not supported for the operation. To perform this operation, one or both operands need to be explicitly cast with a cast operator.
Error at Process results into database [Create additional columns [158]]: Attempt to set the result type of binary operation "CustomerId == (DT_GUID)("{" + @[User::CustomerIdVariable] + "}")" failed with error code 0xC0047081.
Error at Process results into database [Create additional columns [158]]: Computing the expression "[CustomerId] == (DT_GUID)("{" + @[User::CustomerIdVariable] + "}")" failed with error code 0xC0047084. The expression may have errors, such as divide by zero, that cannot be detected at parse time, or there may be an out-of-memory error.
Error at Process results into database [Create additional columns [158]]: The expression "[CustomerId] == (DT_GUID)("{" + @[User::CustomerIdVariable] + "}")" on "output column "MyColumn" (222)" is not valid.
Error at Process results into database [Create additional columns [158]]: Failed to set property "Expression" on "output column "Mycolumn" (222)".Which broadly translates as "Sorry, me no understand what GUID is - do not use a UNIQUEIDENTIFIER column in an expression, ever", without any good explaination as to why, which is nice. Naturally, the next step would be to try and convert the UNIQUEIDENTIFIER column to a string:
(DT_WSTR,36)[CustomerId] == @[User::CustomerIdVariable]
But guess what? This STILL doesn't work... the actual solution is to to convert the UNIQUEIDENTIFIER column to a string, and wrap the variable in curly brackets, as such:
(DT_WSTR,36)[CustomerId] == ("{" + @[User::CustomerIdVariable] + "}")
And I'm not going to even get into non-unicode conversions.
This is an issue with SSIS rather than the use of a UNIQUEIDENTIFIER in your table, but it's still worth bearing in mind when designing your DB. If all of your primary and foreign keys are GUIDs, your BI guys aren't gonna thank you for it.
Summary
I'm not saying GUIDs are useless, or should be totally avoided - I'm just yet to come across a situation where using a GUID as a key column is a better option than an INT or a BIGINT.If you genuinely need a unique and unpredictable GUID for any reason, then simply create a separate UNIQUEIDENTIFIER column in your table, but don't index it or use it as a foreign key:
CREATE TABLE MyTest ( Id INT NOT NULL PRIMARY KEY IDENTITY, UniqueKey UNIQUEIDENTIFIER DEFAULT NEWID() )
If you want to assure that UniqueKey is actually Unique, then add a Unique constraint to the column. However, be aware that by creating a UNIQUE constraint, you are also creating an index - which will need to be set up and maintained appropriately. The main thing we're trying to avoid here is having a fragmented clustered index.
INT and BIGINT outperform UNIQUEIDENTIFIER in every test I've performed here (and I challenge you to come up with a test where they don't) - some results are marginal, others are more obvious, but this speaks volumes. Your column keys should always be as small and efficient as possible, therefore using a 16 byte UNIQUEIDENTIFIER is counter-productive when a perfectly good 4 byte INT is sat there waiting for you.
Remember that the performance tests I've written here are only rough. Ideally each of these tests should be executed on a server a number of times and the results aggregated, as it's very likely that other applications on my dev machine are affecting SQL Server's I/O, CPU and memory usage. However, I think they serve to demonstrate my concerns well.
- Do you really need a globally unique identifier?
- Could the table end up holding millions of rows?
- Could any of the table's foreign references hold millions of rows?
- Would it be better to store UNIQUEIDENTIFIER data in another column, or even another table?
- Does every row need a GUID, or can you generate a GUID ad-hoc?
Further reading
I used some of these links as a reference, but I'm also providing some that offer the "other" side of the argument. One or two are also written by people a lot more knowledgeable than I am when it comes to SQL Server:http://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html