How to enable Selective XML indexes in SQL Server 2012 SP1

February 11th, 2013

SQL Server 2012 SP1 has shipped a great enhancement to XML: Selective XML Indexes. When properly used these indexes can speed up the searching of XML columns tremendously, at little disk/size cost:

The selective XML index feature lets you promote only certain paths from the XML documents to index. At index creation time, these paths are evaluated, and the nodes that they point to are shredded and stored inside a relational table in SQL Server. This feature uses an efficient mapping algorithm developed by Microsoft Research in collaboration with the SQL Server product team. This algorithm maps the XML nodes to a single relational table, and achieves exceptional performance while requiring only modest storage space.

However, at the time of this post, the documentation on MSDN omits a crucial requirement: the database has to be enabled to support these new indexes. Because is an on-disk format change shipped in a service pack release the engine cannot use the new selective XML indexes unless explicitly allowed. And the database version must change to a version that the RTM SQL Server 2012 does not recognize so that the database is not accidentally attached/restored on an RTM instance of SQL Sever 2012 that would not comprehend the new Selective XML Index objects and would panic (undefined behavior). To enable Selective XML Indexes in the database you must run sp_db_selective_xml_index:

Enables and disables Selective XML Index functionality on a SQL Server database. If called without any parameters, the stored procedure returns 1 if the Selective XML Index is enabled on a particular database.

EXECUTE sys.sp_db_selective_xml_index
@db_name = N'AdventureWorks2012'
, @action = N'true';
GO

Be aware that once upgraded to support Selective XML indexes this database can no longer be attached or restored on an RTM instance of SQL Server 2012. This applies to log shipping, Database Mirroring and AlwaysOn relationships, which will break when this upgrade is performed. If all the partners in the log shipping, DBM session or AG are also upgraded to SQL Server 2012 SP1 you can re-enable the relationship after you enabled Selective XML Indexes.

How to enable and disable a queue using SMO

February 6th, 2013

The SMO object model for SQL Server ServiceQueue does allow one to enable or disable a queue, but the property that modifies the queue status is not intuitive, it is IsEnqueueEnabled:

Gets or sets the Boolean property that specifies whether the queue is enabled.

This property matches the catalog view column is_enqueue_enabled in sys.service_queues but bears little resemblance to the T-SQL statement used to enable or disable a queue: ALTER QUEUE ... WITH STATUS = {ON|OFF}

For example the following SMO code snippet:

            Server server = new Server("...");
            Database db = server.Databases["msdb"];
            ServiceQueue sq = new ServiceQueue(db.ServiceBroker, "foo");
            sq.Create();
            sq.IsEnqueueEnabled = false;
            sq.Alter();
            sq.IsEnqueueEnabled = true;
            sq.Alter();

generates the following T-SQL:

CREATE QUEUE [dbo].[foo];
ALTER QUEUE [dbo].[foo]  WITH STATUS = OFF ...;
ALTER QUEUE [dbo].[foo] WITH STATUS = ON ...;

Case Sensitive collation sort order

November 23rd, 2012

A recent inquiry from one of our front line CSS engineers had me look into how case sensitive collations decide the sort order. Consider a simple question like How should the values 'a 1', 'a 2', 'A 1' and 'A 2' sort?


create table [test] (
	[col] varchar(10)
		collate Latin1_General_CS_AS);
go

insert into [test] ([col]) values
	('a 1'),
	('a 2'),
	('A 1'),
	('A 2');
go

select [col]
from [test]
order by [col];

Here are two possible outputs:

Which one is correct? A programmer will chose the first order: 'a 1', 'a 2', 'A 1', 'A 2'. Because if one would implement a string comparison routine it would compare character by character until a difference is encountered, and 'a' sorts ahead of 'A'. But this answer is wrong. The correct sort order is 'a 1', 'A 1', 'a 2', 'A 2'! And if you ran the query in SQL Server you certainly got the second output. But look again at the sort order and focus on just the first character:

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to: alphabetic ordering, diacritic ordering, case ordering

So, in a case sensitive collation, is 'a' ahead or after 'A' in the sort order? The images shows them actually interleaved, is 'a', 'A', 'a', 'A'. What’s going on? The answer is that collation sort order is a little more nuanced that just comparing characters until a difference is encountered. This is described in the Unicode Technical Standard #10: UNICODE COLLATION ALGORITHM. And yes, the same algorithm is applied for non-Unicode types (VARCHAR) too. The algorithm actually gives different weight to character differences and case differences, a difference in alphabetic order is more important than one in case order. To compare the sort order of two strings the algorithm is more like the following:

  • Compare every character in case insensitive, accent insensitive manner. If a difference is found, this decides the sort order. If no difference is found, continue.
  • Compare every character in case insensitive, accent sensitive manner. If a difference is found, this decides the sort order. If no difference is found, continue.
  • Compare every character in case sensitive manner (we already know from the step above there is no accent difference). If a difference is found, this decides the sort order. If no difference is found the strings are equal.

Needless to say the real algorithm does not need to traverse the strings 3 times, but the logic is equivalent to above. And remember that when the strings have different lengths then the comparison expands the shorter string with spaces and compares up to the length of the longest string. Combined with the case sensitivity rules this gives to a somewhat surprising result when using an inequality in a WHERE clause:


select [col]
from [test]
where [col] > 'A'
order by [col];

That’s right, we got back all 4 rows, including those that start with 'a'. This surprises some, but is the correct result. 'a 1' should be in the result, even though 'a' is < 'A'. If you follow the algorithm above: first we expand the shorter string with spaces, so the comparison is between 'a 1' and 'A  '. Then we do the first pass comparison, which is only alphabetic order, case insensitive and accent insensitive, character by character: 'a' and 'A' are equal, ' ' and ' ' are equal, but '1' is > ' '. The comparison stops, we found a alphabetic order difference so 'a 1' > 'A  ', the row qualifies and is included in the result. Ditto for 'a 2'.

Handling exceptions that occur during the RECEIVE statement in activated procedures

October 15th, 2012

The typical SQL Server activation procedure is contains a WHILE (1=1) loop and exit conditions based on checking @@ROWCOUNT. Error handling is done via a BEGIN TRY ... BEGIN CATCH block. This pattern is present in many Service Broker articles on the web, including this web site, in books and in Microsoft samples:

create procedure [<procedure name>]
as
declare @dialog_handle uniqueidentifier
 , @message_type_name sysname
 , @message_body varbinary(max);
set nocount on;

while(1=1)
begin
 begin transaction;
 begin try;
  receive top(1)
   @dialog_handle = conversation_handle
   , @message_type_name = message_type_name
   , @message_body = message_body
  from [<queue name>];
  if @@rowcount = 0
  begin
   rollback;
   break;
  end
  if @message_type_name = N'<my message type>'
  begin
   -- process the message here
                        ...
  end
  else if @message_type_name = N'http://schemas.microsoft.com/SQL/ServiceBroker/Error'
     or @message_type_name = N'http://schemas.microsoft.com/SQL/ServiceBroker/EndDialog'
  begin
   end conversation @dialog_handle;
  end
  commit transaction;
 end try
 begin catch
  declare @error_number int = ERROR_NUMBER()
   , @error_message nvarchar(4000) = ERROR_MESSAGE()
   , @xact_state int = XACT_STATE();
  if @xact_state = -1 or @xact_state = 1
  begin
   rollback;
  end
  -- log the error here
               ....
 end catch
end
go

This patter though contains a problem: it will handle very poorly a disabled queue, and hence it will handle very poorly poison messages.

Error 9617 The service queue “…” is currently disabled

Read the rest of this entry »

Inside the SQL Server 2012 Columnstore Indexes

May 29th, 2012

Columnar storage has established itself as the de-facto option for Business Intelligence (BI) storage. The traditional row-oriented storage of RDBMS was designed for fast single-row oriented OLTP workloads and it has problems handling the large volume range oriented analytical processing that characterizes BI workloads. But what is columnar storage and, more specifically, how does SQL Server 2012 implement columnar storage with the new COLUMNSTORE indexes?

Read the rest of this entry »

What is an LSN: Log Sequence Number

January 17th, 2012

LSNs, or Log Sequence Numbers, are explained on MSDN at Introduction to Log Sequence Numbers:

Every record in the SQL Server transaction log is uniquely identified by a log sequence number (LSN). LSNs are ordered such that if LSN2 is greater than LSN1, the change described by the log record referred to by LSN2 occurred after the change described by the log record LSN.

There are several places where LSNs are exposed. For example sys.database_recovery_status has the columns last_log_backup_lsn and fork_point_lsn, sys.database_mirroring has the mirroring_failover_lsn column and the msdb table backupset contains first_lsn, last_lsn, checkpoint_lsn, database_backup_lsn, fork_point_lsn and differential_base_lsn. Not surprisingly all these places where LSNs are exposed are related to backup and recovery (mirroring is a form of recovery). The LSN is exposed a numeric(25,0) value that can be compared: a bigger LSN number value means a later log sequence number and therefore it can indicate if more log needs to be backed up or recovered.

Yet we can dig deeper. Look at the LSN as exposed by the fn_dblog function:

select * from fn_dblog(null, null);

Current LSN Operation Context Transaction ID ...
00000014:00000061:0001 LOP_BEGIN_XACT LCX_NULL 0000:000001e4
00000014:00000061:0002 LOP_INSERT_ROWS LCX_NULL 0000:000001e4
...
00000014:00000061:02ea LOP_INSERT_ROWS LCX_NULL 0000:000001e4
00000014:000000d9:0001 LOP_INSERT_ROWS LCX_NULL 0000:000001e4
00000014:000000d9:0002 LOP_INSERT_ROWS LCX_NULL 0000:000001e4
...
00000014:00000153:021e LOP_INSERT_ROWS LCX_NULL 0000:000001e4
00000014:00000153:021f LOP_COMMIT_XACT LCX_NULL 0000:000001e4
00000014:000001ab:0001 LOP_BEGIN_XACT LCX_NULL 0000:000001ea
...
00000014:000001ab:01ac LOP_INSERT_ROWS LCX_NULL 0000:000001ea
00000015:00000010:0001 LOP_INSERT_ROWS LCX_NULL 0000:000001ea

The LSN is shown as a three part structure. The first part seems to stay the same (the 14), the middle part apparently has some erratic increases and the last part seems to increase monotonically but it resets back to 0 when the middle part changes. It is much easier to understand what’s happening when you know what the three parts are:

  • the first part is the VLF sequence number.
  • the middle part is the offset to the log block
  • the last part is the slot number inside the log block

Virtual Log files

SQL Server manages the log by splitting into regions called VLFs, Virtual Log Files, as explained in the Transaction Log Physical Architecture. We can inspect the VLFs of a database using DBCC LOGINFO:

FileId FileSize    StartOffset  FSeqNo  Status      Parity CreateLSN
------ ----------- -------------------------------- ------ -----------------
2      253952      8192         20      2           64     0
2      253952      262144       21      2           64     0
2      270336      516096       22      2           64     20000000021700747
2      262144      786432       23      2           64     21000000013600747
2      262144      1048576      24      2           64     22000000024900748
...
2      393216      16973824     75      2           64     74000000013600748
2      393216      17367040     0       0           0      74000000013600748
2      393216      17760256     0       0           0      74000000013600748
2      524288      18153472     0       0           0      74000000013600748

(59 row(s) affected)

So back to our fn_dblog result: the Current LSN value of 00000014:00000061:0001 means the following log record:

  • the VLF sequence number 0×14 (20 decimal)
  • in the log block starting at offset 0×61 in the VLF (measured in units of 512 bytes)
  • the slot number 1

From the DBCC LOGINFO output we know that the VLF with sequence number 0×14 is starting at offset 8192 in the LDF file and has a size of 253952. It is the first VLF in the log file. the next LSNs increase the slot number until the LSN 00000014:00000061:02ea and then the LSN changes to 00000014:000000d9:0001. This means that the block that starts at offset 0×61 has filled up and the next block, from offset 0xd9, started to be used. If we do the math, 0xd9-0×61 is 0×78 (or 120 decimal), which is exactly 60Kb. You may remember that the maximum SQL Server log block size is 60Kb.

Our transaction continued to insert records until the LSN 00000014:00000153:021e then at the next LSN we have a commit operation (LOP_COMMIT_XACT). The next LSN, which is for the next transaction start, is at LSN 00000014:000001ab:0001. What does this tell us? The transaction terminated with the INSERT operation at LSN 00000014:00000153:021e and then it issued a COMMIT. The commit operation generated one more log record, the one at LSN 00000014:00000153:021f and then asked for the commit LSN to be hardened. This causes the log block containing this LSN (the log block starting at offset 0×153 in the VFL 0×14) to be closed and written to disk (along with any other not-yet-written log-block that is ahead of this block). The next block can start immediately after this block, so the next operation will have the LSN 00000014:000001ab:0001. If we do the math we can see that 0x1ab-0×153 is 0×58 (decimal 88) so this last log block had 44Kb.

I hope by now is clear what happened next in the fn_dblogoutput I shown: the transaction continues to insert records generating more LSNs. The entry at 00000014:000001ab:01ac is not only the last one in the current log block, but is also the last one in the current VLF. The next LSN is 00000015:00000010:0001, the LSN at slot 1 in the first log block (offset 0×10) of the VLF with sequence number 0×15.

Decimal LSNs

What about those LSN numbers like the 20000000021700747 CreateLSN value shown in DBCC LOGINFO output? They are still three part LSNs, but the values are represented in decimal. 20000000021700747 would be the LSN 00000014:000000d9:02eb.

Log flush waits

Look at the following fn_dblog output:

...
00000016:0000003c:0001  LOP_BEGIN_XACT      LCX_NULL    0000:0000057f
00000016:0000003c:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:0000057f
00000016:0000003c:0003  LOP_COMMIT_XACT     LCX_NULL    0000:0000057f
00000016:0000003d:0001  LOP_BEGIN_XACT      LCX_NULL    0000:00000580
00000016:0000003d:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:00000580
00000016:0000003d:0003  LOP_COMMIT_XACT     LCX_NULL    0000:00000580
00000016:0000003e:0001  LOP_BEGIN_XACT      LCX_NULL    0000:00000581
00000016:0000003e:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:00000581
00000016:0000003e:0003  LOP_COMMIT_XACT     LCX_NULL    0000:00000581
00000016:0000003f:0001  LOP_BEGIN_XACT      LCX_NULL    0000:00000582
00000016:0000003f:0002  LOP_INSERT_ROWS     LCX_HEAP    0000:00000582
00000016:0000003f:0003  LOP_COMMIT_XACT     LCX_NULL    0000:00000582
...

Can you spot the problem? Notice how the LSN slot number stays at at 1,2,3 and the log block number grows very frequently: 3c, 3d, 3e, 3f… This is the log signature of a sequence of single row inserts that each committed individually. After each INSERT the application had to wait for the log block to be flushed to disk. The logs blocks were all of minimal size, 512 bytes.

Here is how I generated this log:

set nocount on;
declare @i int = 0;
while @i < 1000
begin
	insert into t (filler) values ('A');
	set @i += 1;
end
go

Had the application used a batch commit it would had solved several issues:

  • only wait for one or few commits to flush the log.
  • write less log, since every transaction generates an LOP_BEGIN_XACT and an LOP_COMMIT_XACT (they add up!).
  • write the operations in few large IO requests as opposed to a lot of small IO requests.

So lets add batch commits to our test script:

set nocount on;
declare @i int = 0;
begin transaction
while @i < 1000
begin
	insert into t (filler) values ('A');
	set @i += 1;
	if @i % 100 = 0
	begin
		commit;
		begin transaction;
	end
end
commit
go

and then lets look at the log:

00000016:0000011d:0001  LOP_BEGIN_XACT     LCX_NULL   0000:000005d9
00000016:0000011d:0002  LOP_INSERT_ROWS    LCX_HEAP   0000:000005d9
00000016:0000011d:0003  LOP_INSERT_ROWS    LCX_HEAP   0000:000005d9
...
00000016:0000011d:006f  LOP_INSERT_ROWS    LCX_HEAP   0000:000005d9
00000016:0000011d:0070  LOP_COMMIT_XACT    LCX_NULL   0000:000005d9
00000016:00000136:0001  LOP_BEGIN_XACT     LCX_NULL   0000:000005db
00000016:00000136:0002  LOP_INSERT_ROWS    LCX_HEAP   0000:000005db
...
00000016:00000136:0064  LOP_INSERT_ROWS    LCX_HEAP   0000:000005db
00000016:00000136:0065  LOP_INSERT_ROWS    LCX_HEAP   0000:000005db
00000016:00000136:0066  LOP_COMMIT_XACT    LCX_NULL   0000:000005db
00000016:0000014d:0001  LOP_BEGIN_XACT     LCX_NULL   0000:000005dc
00000016:0000014d:0002  LOP_INSERT_ROWS    LCX_HEAP   0000:000005dc
00000016:0000014d:0003  LOP_INSERT_ROWS    LCX_HEAP   0000:000005dc
...
00000016:0000014d:0066  LOP_INSERT_ROWS    LCX_HEAP   0000:000005dc
00000016:0000014d:0067  LOP_COMMIT_XACT    LCX_NULL   0000:000005dc

We can see how the batch commit has created fewer, larger, log blocks. The application had to wait fewer times for the log to harden, and on each wait it issued a larger IO request. The log blocks are also more densely filled with LOP_INSERT_ROWS operations and do not have the overhead of having to log the LOP_BEGIN_XACT/LOP_COMMIT_XACT for every row inserted.

Conclusion

This little example shows not only how to understand the LSN structure, but it also shows how to read into the fn_dblog output. I also wanted to show the typical log signature of a commit operation: the LOP_COMMIT_XACT operation is recorded in the log and the log block is closed and flushed to disk. The next LSN will usually have slot 1 in the next block. The last example even shows how reading the log can spot potential application performance problems.

SQL Server table columns under the hood

October 20th, 2011

You probably can easily answer a question like ‘What columns does this table have?’. Whether you use the SSMS object explorer, or sp_help, or you query sys.column, the answer is fairly easy to find. But what is I ask ‘What are the physical columns of this table?’. Huh? Is there any difference? Lets see.

At the logical layer tables have exactly the structure you declare it in your CREATE TABLE statement, and perhaps modifications from ALTER TABLE statements. This is the layer at which you can look into sys.columns and see the table structure, or look at the table in SSMS object explorer and so on and so forth. But there is also a lower layer, the physical layer of the storage engine where the table might have surprisingly different structure from what you expect.

Inspecting the physical table structure

To view the physical table structure you must use the undocumented system internals views: sys.system_internals_partitions and sys.system_internals_partition_columns:

select p.index_id, p.partition_number,
	pc.leaf_null_bit,
	coalesce(cx.name, c.name) as column_name,
	pc.partition_column_id,
	pc.max_inrow_length,
	pc.max_length,
	pc.key_ordinal,
	pc.leaf_offset,
	pc.is_nullable,
	pc.is_dropped,
	pc.is_uniqueifier,
	pc.is_sparse,
	pc.is_anti_matter
from sys.system_internals_partitions p
join sys.system_internals_partition_columns pc
	on p.partition_id = pc.partition_id
left join sys.index_columns ic
	on p.object_id = ic.object_id
	and ic.index_id = p.index_id
	and ic.index_column_id = pc.partition_column_id
left join sys.columns c
	on p.object_id = c.object_id
	and ic.column_id = c.column_id
left join sys.columns cx
	on p.object_id = cx.object_id
	and p.index_id in (0,1)
	and pc.partition_column_id = cx.column_id
where p.object_id = object_id('...')
order by index_id, partition_number;

Lets inspect some simple table structures:

create table users (
	user_id int not null identity(1,1),
	first_name varchar(100) null,
	last_name varchar(100) null,
	birth_date datetime null);

Running our query after, of course, we specify object_id('users'):

We can see that the physical structure is very much as we expected: the physical rowset has 4 physical columns, of the expected types and sizes. One thing to notice is that, although the column order is the one we specified, the columns are layout on disk in a different order: the user_id is stored in row at offset 4, followed by the birth_date at offset 8 and then by the two variable length columns (first_name and last_name) that have negative offsets, an indication that they reside in the variable size portion of the row. This should be of no surprise as we know that the row format places all fixed length columns first in the row, ahead of the variable length columns.

Adding a clustered index

Our table right now is a heap, we should make user_id a primary key, and lets check the table structure afterward:

alter table users add constraint pk_users_user_id primary key (user_id);


As we can see, the table has changed from a heap into a clustered index (index_id changed from 0 to 1) and the user_id column has become part of the key.

Non-Unique clustered indexes

Now lets say that we want a different clustered index, perhaps by birth_date (maybe our application requires frequent range scans on this field). We would change the primary key into a non-clustered primary key (remember, the primary key and the clustered index do not have to be the same key) and add a clustered index by the birth_date column:

alter table users drop constraint pk_users_user_id;
create clustered index cdx_users on users (birth_date);
alter table users add constraint
	pk_users_user_id primary key nonclustered  (user_id);


Several changes right there:

  • A new index has appeared (index_id 2), which was expected, this is the non-clustered index that now enforces the primary key constraint on column user_id.
  • The table has a new physical column, with partition_column_id 0. This new column has no name, because it is not visible in the logical table representation (you cannot select it, nor update it). This is an uniqueifier column (is_uniqueifier is 1) because we did not specify that our clustered index in unique:

    If the clustered index is not created with the UNIQUE property, the Database Engine automatically adds a 4-byte uniqueifier column to the table. When it is required, the Database Engine automatically adds a uniqueifier value to a row to make each key unique. This column and its values are used internally and cannot be seen or accessed by users.



    Klaus Aschenbrenner has a good series of articles that explain uniqueifier columns in more detail.

  • The non-clustered index that enforces the primary key constraint has 3 physical columns: user_id and two unnamed ones. This is because each row in the non-clustered index contains the corresponding clustered index row key values, in our case the birth_date column and the uniqueifier column.

Online operations and anti-matter columns

Is the uniqueifier column the only hidden column in a table? No. Lets rebuild our non-clustered index, making sure we specify it to be an online operation:

alter index pk_users_user_id on users rebuild with (online=on);


Notice how the non-clustered index now has one more column, a new column that has the is_antimatter value 1. As you probably guess, this is an anti-matter column. For an in-depth explanation of what is the purpose of the anti-matter column I recommend reading Online Indexing Operations in SQL Server 2005:

During the build phase, rows in the new “in-build” index may be in an intermediate state called antimatter. This mechanism allows concurrent DELETE statements to leave a trace for the index builder transaction to avoid inserting deleted rows. At the end of the index build operation all antimatter rows should be cleared.

Note that even after the online operation finishes and the antimatter rows are removed, the antimatter column will be part of the physical rowset structure.

There could exist more hidden columns in the table, for example if we enable change tracking:

alter table users ENABLE CHANGE_TRACKING;


Enabling change tracking on our table has added one more hidden column.

Table structure changes

Next lets look at some table structure modifying operations: adding, dropping and modifying columns. Were going to start anew with a fresh table for our examples:

create table users  (
	user_id int not null identity(1,1) primary key,
	first_name char(100) null,
	last_name char(100) null,
	birth_date datetime null);
go


Next lets modify some columns: we want to make the birth_date not nullable and reduce the length of the first_name to 75:

alter table users alter column birth_date datetime not null;
alter table users alter column first_name char(75);
go


Notice how the two alter operations ended up in adding a new column each, and dropping the old column. But also note how the null bit and the leaf_offset values have stayed the same. This means that the column was added ‘in place’, replacing the dropped column. This is a metadata only operation that did not modify any record in the table, it simply changed how the data in the existing records is interpreted.

But now we figure out the 75 characters length is wrong and we want to change it back to 100. Also, the birth_date column probably doesn’t need hours, so we can change it to a date type:

alter table users alter column birth_date date not null;
alter table users alter column first_name char(100);
go


The birth_date column has changed type and now it requires only 3 bytes, but the change occurred in-place, just as the nullability change we did before: it remains at the same offset in the record and it has the same null bit. However, the first_name column was moved from offset 8 to offset 211, and the null bit was changed from 4 to 5. Because the first_name column has increased in size it cannot occupy the same space as before in the record and the record has effectively increased to accommodate the new first_name column. This happened despite the fact that the first_name column was originally of size 100 so in theory it could reclaim the old space it used in the record, but the is simply a too corner case for the storage engine to consider.

By now we figure that the fixed length 100 for the first_name and last_name columns was a poor choice, so we would like to change them to more appropriate variable length columns:

alter table users alter column first_name varchar(100);
alter table users alter column last_name varchar(100);
go


The type change from fixed length column to variable length column cannot be done in-place, so the first_name and last_name columns get new null bit values.They also have negative leaf_offset values, which is typical for variable length columns as they don’t occupy fixed positions in the record.

Next lets change the length of the variable columns:

alter table users alter column first_name varchar(250);
alter table users alter column last_name varchar(250);
go

For this change the column length was modified without dropping the column and adding a new one. An increase of size for a variable length columns is one of the operations that is really altering the physical column, not dropping the column and adding a new one to replace it. However, a decrease in size, or a nullability change, does again drop and add the column, as we can quickly check now:

alter table users alter column first_name varchar(100);
alter table users alter column last_name varchar(250) not null;
go


Finally, lets say we tried to add two more fixed length columns, but we we undecided on the name and length so we added a couple of columns, then deleted them and added again a new one:

alter table users add mid_name char(50);
alter table users add surname char(25);
alter table users drop column mid_name;
alter table users drop column surname;

alter table users add middle_name char(100);
go


This case is interesting because it shows how adding a new fixed column can reuse the space left in the row by dropped fixed columns, but only if the dropped columns are the last fixed columns. In our table the columns mid_name and surname where originally added at offset 211 and 261 respectively and their length added up to 75 bytes. After we dropped them, the middle_name column we added is placed at offset 211, thus reusing the space formerly occupied by the dropped columns. This happens even though the length of the newly added column is 100 bytes, bigger than the 75 bytes occupied by the dropped columns before.

By now, our 5 column table has actually 15 columns in the physical storage format, 10 dropped columns and 5 usable columns. The table uses 412 bytes of fixed space for the 3 fixed length columns that have a total length of only 112 bytes. The variable length columns that now store the first_name and last_name are stored in the record after these 412 reserved bytes for fixed columns that are now dropped. Since the records always consume all the reserved fixed size, this is quite wasteful. How do we reclaim it? Rebuild the table:

alter table users rebuild;
go


As you can see the rebuild operation got rid off the dropped columns and now the physical storage layout is compact, aligned with the logical layout. So whenever doing changes to a table structure remember that at the storage layer most changes are cumulative, they are most times implemented by dropping a column and adding a new column back, with the new type/length/nullability. Whenever possible the a modified column reuses the space in the record and newly added columns may reuse space previously used by dropped columns.

Partitioned Tables

When partitioning is taken into account another dimension of the physical table structure is revealed. To start, lets consider a typical partitioned table:

create partition function pf (date) as range for values ('20111001');
go

create partition scheme ps as partition pf all to ([PRIMARY]);
go

create table sales (
	sale_date date not null,
	item_id int not null,
	store_id int not null,
	price money not null)
	on ps(sale_date);
go


As we know, the partitioned tables are, from a physical point of view, a collection of rowsets that belong to the same logical object (the 'table'). Looking under the hood shows that our table has two rowsets, and they have identical column structure. Typically new partitions are added by the ETL process that uploads data into a staging table that gets switched in using a fast partition switch operation:

create table sales_staging (
	sale_date date not null,
	item_id int not null,
	store_id int not null,
	price numeric(10,2) not null,
	constraint check_partition_range
		check (sale_date = '20111002'))
	on [PRIMARY];

alter partition scheme ps next used [PRIMARY];
alter partition function pf() split range ('20111002');
go

alter table sales_staging switch to sales partition $PARTITION.PF('20111002');
go
Msg 4944, Level 16, State 1, Line 2

ALTER TABLE SWITCH statement failed because column 'price' has data type numeric(10,2) in source table 'test.dbo.sales_staging' which is different from its type money in target table 'test.dbo.sales'.

OK, so we made a mistake in the staging table, so lets correct it, then try to switch it in again:

alter table sales_staging alter column price money not null;
alter table sales_staging switch to sales partition $PARTITION.PF('20111002');
go


We see that the partition switch operation has switched in the rowset of the staging table into the second partition of the sales table. But since the staging table had an operation that modified a column type, which in effect has dropped the numeric price columns and added a new price column of the appropriate money type, the second rowset of the partitioned table now has 5 columns, including a dropped one. The partition switch operation brings into the partitioned table the staging table as is, dropped columns and all. What that means is that partitions of the partitioned table can have, under the hood, a completely different physical structure from one another. Normally this should be of little concern and just a courisity to woe the newbies at DBA cocktail parties, but this problem has a darker side, known as KB2504090:

This issue occurs because the accessor that SQL Server uses to insert data into different partitions recognizes the metadata changes incorrectly. When data is inserted into the new partition that is created after a column is dropped, the number of the maximum nullable columns in the new partition may be one less than the number of the maximum nullable columns in the old partition.

If you ever find your partitioned tables not to have an uniform internal structure across all partitions, I recommend you rebuild the partition that is different. This simple query can be used to look at the number of physical columns in each partition of a table:

select count(*) as count_columns,
	index_id,
	partition_number
from sys.system_internals_partitions p
join sys.system_internals_partition_columns pc
	on p.partition_id = pc.partition_id
where p.object_id = object_id('sales')
group by index_id, partition_number;
go


This query shows that partitions 1 and 3 have 4 physical columns, while partition 2 has 5. We can rebuild partition 2:

alter table sales rebuild partition = 2;
go

With this operation we have rebuild the partition number 2 from scratch and removed all dropped columns in the process.

Note that my query above would only detect differences in the number of physical columns, but two partitions can still have a very different physical layout even if they have the same number of physical columns, eg. one can have physical column number 9 dropped and physical column number 10 used, while the other can have the opposite. Use your judgement when looking at the internal physical column layout to understand if they are truly the same.

Understanding Hash, Sort and Exchange Spill events

October 19th, 2011

Certain SQL Server query execution operations are calibrated to perform best by using a (somewhat) large amount of memory as intermediate storage. The Query Optimizer will choose a plan and estimate the cost based on these operators using this memory scratch-pad. But this is, of course, only an estimate. At execution the estimates may prove wrong and the plan must continue despite not having enough memory. In such an event, these operators spill to disk. When a spill occurs, the memory scratch-pad is flushed into tempdb and more data is accommodated in the (now) free memory. When the data flushed to tempdb is needed again, is read from disk. Needless to say, spilling into tempdb is order of magnitude slower than using just the memory scratch-pad. Monitoring for spilling is specially important in ETL jobs, since these occurrences may cause the ETL job to stretch for many more minutes, sometimes even hours. For an exhaustive discussion of ETL, including some references to spills, see The Data Loading Performance Guide.

Hash Warning Event

Hash recursion occurs when the build input does not fit into available memory, resulting in the split of input into multiple partitions that are processed separately. If any of these partitions still do not fit into available memory, it is split into subpartitions, which are also processed separately. This splitting process continues until each partition fits into available memory or until the maximum recursion level is reached (displayed in the IntegerData data column).

This is one of the most common spills. Hash Join is a darling of the Query Optimizer, as it a very fast operator that can join two unsorted sources efficiently, requiring a single pass over each source. Not only that, but it can also be used for duplicate removal (eg. DISTINCT clause) and grouping aggregates. See Understanding Hash Joins for more details.

Unfortunately it also requires a lot of memory. Hash spill occurrences are usually an indication of bad cardinality estimates that are in turn caused, most times, by missing or outdated statistics. The first action to do, when faced with Hash Join spill warnings, is to update (or create) stats on the columns involved. If the problem persists then you must resort to a different join type. Since the optimizer would had already picked a better join if it could, you need to look at the problem from a different angle: how could you help the optimizer to pick a different join, w/o forcing it?

The optimizer would pick a LOOP if it could seek the inner side for a reasonable number of times, so perhaps you need an index on the inner side to allow a seek and/or an index on the outer side that would filter the produced output early so it can reduce the number of probes in the inner side. See Understanding Nested Loops Joins for more details.

The other alternative is a MERGE join, but merge requires the input on both sides to be sorted. Adding an index to each of the sources on both sides of the join that provides the order guarantee would likely woe the optimizer into using the MERGE join. See Understanding Merge Joins for more details.

Sort Warning Event

The Sort Warnings event class indicates that sort operations do not fit into memory. This does not include sort operations involving the creation of indexes, only sort operations within a query (such as an ORDER BY clause used in a SELECT statement).

If a query requires an order guarantee (eg. it has an ORDER BY clause, or it projects a function like ROW_NUMBER) and there is no index to guarantee the order then there is little choice to have: the execution must sort the input before proceeding. If the input is small then the sort occurs in memory and is very cheap, but in this case no spill warning would occur. The Query Optimizer is usually smart to postpone the sorting until all filtering is complete, so if it still spills it usually means there is no more filtering that can be applied. The solution when faced with Sort spills is usually to add a covering index that provides the desired order.

Another case of Sort spill warning is when the Query Optimizer goes creative and adds a sort into a plan that technically does not require an order guarantee. Such events are though very rare and esoteric. The action to take really depends from case to case.

Exchange Spill Event

The Exchange Spill event class indicates that communication buffers in a parallel query plan have been temporarily written to the tempdb database. This occurs rarely and only when a query plan has multiple range scans.

You’ll probably never going to hit this problem. If you do, I recommend you go over the recommendations in the linked article: Exchange Spill Event

Conclusion

Technically there is one more spill class: the spool spill. Since spools are *meant* to spill, the presence of a spool spill is usually less of a concern.

The purpose of this article is to show that there is ample documentation available on MSDN regarding these spill events. The tempdb spills are easily detectable and reasonably explained in the product documentation. Presence of spills may indicate potential performance problems as a spill involves disk reads and writes and is many times slower than the corresponding in-memory-only operation. They also add overload to tempdb and may cause contention.

T-SQL functions do no imply a certain order of execution

August 10th, 2011

Looking at this question on StackOverflow: Conversion failed when converting from a character string to uniqueidentifier error in SQL Server one can see a reproducible example where a string split UDF works fine in the SELECT statement, but it gives a conversion error in the DELETE statement. Certainly, a bug, right? Actually, not.

The issue at hand is in fact remarkably similar to another common misconception around T-SQL I had to debunk some time ago, see On SQL Server boolean operator short-circuit: that C like operator short-circuit is guaranteed in T-SQL (hint: it isn’t, read the linked article to see a clear counter example).

In the StackOverlow post the misconception is that order of declaration implies order of execution, that the function is evaluated somehow separately from the rest of the query and some sort of temporary result is created that is then used in the overall query execution. This understanding comes naturally to the imperative procedural language mindset of developers trained in C, C++, C# and other similar languages. But in the SQL Server declarative language that is T-SQL, your intuition is actually wrong. To illustrate I will give a simple counter-example, reusing the code from my earlier boolean short-circuit article:


create table eav (
    eav_id int identity(1,1) primary key,
    attribute varchar(50) not null,
    is_numeric bit not null,
    [value] sql_variant null);
create index eav_attribute on eav(attribute) include ([value]);
go

-- Fill the table with random values
set nocount on
declare @i int;
select @i = 0;
begin tran
while @i < 100000
begin
    declare @attribute varchar(50),
        @is_numeric bit,
        @value sql_variant;
    select @attribute = 'A' + cast(cast(rand()*1000 as  int) as varchar(3));
    select @is_numeric = case when rand() > 0.5 then 1 else 0 end;
    if 1=@is_numeric
        select @value = cast(rand() * 100 as int);
    else
        select @value = 'Lorem ipsum';
    insert into eav (attribute, is_numeric, [value])
    values (@attribute, @is_numeric, @value);
    select @i = @i+1;
    if (@i % 1000 = 0)
    begin
		commit;
		raiserror(N'Inserted %d', 0,0,@i);
		begin tran;
    end
end
commit
go

-- insert a 'trap'
insert into eav (attribute, is_numeric, [value])
values ('B1', 0, 'Gotch ya');
go

Now we’re going to define our inline table UDF:


create function udf_get_only_numerics()
returns table
with schemabinding
as return
	select [value], [attribute]
	from dbo.eav
	where is_numeric = 1;
go

And now lets innocently query our UDF:


select [value]
	from dbo.udf_get_only_numerics ()
    where attribute = 'B1'
    and cast([value] as int) > 50
go

Since we’re selecting from the UDF, we’re guaranteed that we’re only going to see values that have the is_numeric column value 1, so the cast([value] as int) must always succeed, right? Wrong, we’ll get a conversion exception:


Msg 245, Level 16, State 1, Line 3
Conversion failed when converting the varchar value 'Gotch ya' to data type int.

So what’s going on? Lets insert a known value and inspect the execution plan:


insert into eav (attribute, is_numeric, [value])
values ('B2', 1, 65);
go

select [value]
	from dbo.udf_get_only_numerics ()
    where attribute = 'B2'
    and cast([value] as int) > 50
go

Looking at the plan we see that, just as in my boolean short-circuit example, the presence of an index on attribute that includes the value column but not the is_numeric was too good opportunity to pass for the optimizer. And this plan will evaluate the cast([value]as int) expression before it evaluate the WHERE clause inside the UDF. That’s right, the WHERE clause of the UDF has moved in the execution plan above the WHERE clause of the query using the UDF. The naive assumption that the function definition posses some sort of barrier for execution was proven wrong.

So how does this explains the failure seen in the StackOverflow question? The UDF, which was originally posted on Tally OH! An Improved SQL 8K “CSV Splitter” Function article on sqlservercentral.com, contains WHERE clauses to that in effect restricts the output to only complete tokens. w/o the WHERE clause it would return non-complete token. We can easily test this, lets remove the WHERE clause and use the same input as the test in the StackOverflow question:


CREATE FUNCTION dbo.DelimitedSplit8K_noWhere
--===== Define I/O parameters
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover VARCHAR(8000)
 WITH E1(N) AS (
   SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
   SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
   SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (
                 SELECT 0 UNION ALL
                 SELECT TOP (DATALENGTH(ISNULL(@pString,1)))
                   ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (
                 SELECT t.N+1
                   FROM cteTally t
                )
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
        Item  = SUBSTRING(@pString,s.N1,
                ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))
   FROM cteStart s;

When we run this the output contains not only the two token in the input, but also every substring of these tokens:


-- Test string that will be split into table in the DelimitedSplit8k function
declare @temp varchar(max) =
     '918E809E-EA7A-44B5-B230-776C42594D91,6F8DBB54-5159-4C22-9B0A-7842464360A5'
select Item from dbo.DelimitedSplit8K_nowhere(@temp, ',');
go
Item
-----------------------------------------
918E809E-EA7A-44B5-B230-776C42594D91
18E809E-EA7A-44B5-B230-776C42594D91
8E809E-EA7A-44B5-B230-776C42594D91
E809E-EA7A-44B5-B230-776C42594D91
809E-EA7A-44B5-B230-776C42594D91
09E-EA7A-44B5-B230-776C42594D91
9E-EA7A-44B5-B230-776C42594D91
E-EA7A-44B5-B230-776C42594D91
-EA7A-44B5-B230-776C42594D91
...

By now the explanation to the issue seen in the StackExchange question should be obvious: just as in my example, the WHERE clause was pulled out from the function and placed in the generated plan above the cast(Item as uniqueidentifier) expression. During execution SQL Server is asked to convert the string ’18E809E-EA7A-44B5-B230-776C42594D91′ to a uniqueidentifier and, understandably, it fails.

Do no make assumptions about order of executions when using inline table UDFs. Q.E.D.

Online Index Operations for indexes containing LOB columns

August 5th, 2011

SQL Server supports online index and table rebuild operations which allow for maintenance operations to occur w/o significant downtime. While a table is being rebuild, or a new index is being built on it, the table is fully utilizable. It can be queried and any updates done to the table while the online rebuild operation is occurring will be contained in the final rebuilt table. A detailed explanation on how these online rebuild operations work can be found in the Online Indexing Operations in SQL Server 2005 white paper. But Online Index Build operations in SQL Server 2005, 2008 and 2008 R2 do not support tables that contain LOB columns, attempting to do so would trigger an error:

Msg 2725, Level 16, State 2, Line …
An online operation cannot be performed for index ‘…’ because the index contains column ‘…’ of data type text, ntext, image, varchar(max), nvarchar(max), varbinary(max), xml, or large CLR type. For a non-clustered index, the column could be an include column of the index. For a clustered index, the column could be any column of the table. If DROP_EXISTING is used, the column could be part of a new or old index. The operation must be performed offline.

To be accurate the restriction applies not to tables, but to any index or heap that contains an LOB column. That, of course, includes any clustered index or the base heap of a table if the table contains any LOB columns, but it would include any non-clustered index that includes a LOB column. In other words I can rebuild online non-clustered indexes of any table as long as they don’t use the INCLUDE clause to add a LOB column from the base table, but for sure I cannot rebuild online the table itself (meaning the clustered index or the heap). Nor can I add a clustered index to a base heap online, if the table contains LOB columns.

The question whether one should just use VARCHAR(MAX) and stop worrying about the chosen field size has came up on StackOverflow several times (varchar(max) everywhere?) and I always pointed out that there are at least some limitations (impossibility to do do online maintenance rebuild operations, impossibility to index such fields) and also all MAX types have a slight performance overhead, see Performance comparison of varchar(max) vs. varchar(N)).

Online Index Build, now with LOBs

Starting with SQL Server 11 it is actually permitted to build (and rebuild) online indexes and heaps containing LOB columns. The old legacy types (text, ntext and image) are not supported, not surprising considering that these types are on the deprecation path.

To understand why the original online rebuild operations from previous versions did not support LOB columns we need to consider the SQL Server Table and Index Organization. All indexes and tables consist of three allocation units: one for row data, one for overflow row data and one for LOB data. We can see this if we inspect the sys.system_internals_allocation_units system catalog view:


create table test (id int not null identity(1,1),
	somevar1 varchar(6000),
	somevar2 varchar(6000),
	someblob varchar(max))
go

insert into test (somevar1, somevar2, someblob) values ('A', 'B', 'C');
insert into test (somevar1, somevar2, someblob) values
         (replicate('A', 6000), replicate('B', 6000), replicate('C', 8000))
go

select au.*
from sys.system_internals_allocation_units au
join sys.system_internals_partitions p on au.container_id = p.partition_id
where p.object_id = object_id('test');
go

Our test table shows three allocation units. Now lets rebuild our table and look again at our allocation units:


alter table test rebuild;
go

select au.*
from sys.system_internals_allocation_units au
join sys.system_internals_partitions p on au.container_id = p.partition_id
where p.object_id = object_id('test');
go

We can see that our DATa and SLOB (aka. row overflow) allocation units have changed because they were rebuilt (they have different IDs and start at different pages). But the important thing is that the BLOB allocation unit has not changed. After the offline table rebuild, it has the same ID and starts at the same pages. This is because table and index rebuild operations do not rebuild the LOB data. They rebuild the row data and the row-overflow data, but the newly built rows will simply point back to the same old LOB data. The idea is that tables with LOB columns have large LOB values and rebuilding the LOB data would be prohibitive, with little or no benefit.

Offline operations can avoid rebuilding the LOB data without problems, but for online index and table rebuilds this poses an issue: for the duration of the online rebuild operation both the old rowset (the old index/table) and the new rowset would point to the same LOB data while updates are being made to rows.

In SQL Server 11 this problem was solved and now online operations can rebuild indexes and tables with LOB columns while keeping the data in the LOB allocation unit in a consistent state. SQL Server will internally track how LOB data is referenced by both the old index and the new index being built and will take appropriate actions to manage the sharing of the LOB data.

Limitations

The following restrictions and limitations apply only for the duration of the Online Index Rebuild operation:

Partial LOB .WRITE updates are transformed into full updates.
LOB data supports a highly efficient update mode, the .WRITE syntax. This is critical in creating streaming semantics, see Download and Upload images from SQL Server via ASP.Net MVC. When the .WRITE syntax is used on a LOB column belonging to an index that is being rebuilt online the generated plan will silently change it into a full value update, which generates significantly more log. If you rely heavily on this functionality be aware and schedule your online rebuilds accordingly.
DBCC CHECK operations will skip the consistency check of LOB allocation units belonging to indexes that are in the process of being rebuilt online.
During the online operation the LOB allocation unit is shared between the old index and the new index and is consistent if you consider both owners, however it may look inconsistent if considered from either one of the owner point of view.
File SHRINK operation will skip pages belonging to LOB allocation units belonging to indexes that are in the process of being rebuilt online.
If LOB data is shrunk the pointers in the ROW data referencing the LOB data that had moved have to be updated. While an online index rebuild occurs there could be two sets of pointers referencing the same LOB data, one in the old rowset and one in the new rowset.