Online Index Operations for indexes containing LOB columns

August 5th, 2011

SQL Server supports online index and table rebuild operations which allow for maintenance operations to occur w/o significant downtime. While a table is being rebuild, or a new index is being built on it, the table is fully utilizable. It can be queried and any updates done to the table while the online rebuild operation is occurring will be contained in the final rebuilt table. A detailed explanation on how these online rebuild operations work can be found in the Online Indexing Operations in SQL Server 2005 white paper. But Online Index Build operations in SQL Server 2005, 2008 and 2008 R2 do not support tables that contain LOB columns, attempting to do so would trigger an error:

Msg 2725, Level 16, State 2, Line …
An online operation cannot be performed for index ‘…’ because the index contains column ‘…’ of data type text, ntext, image, varchar(max), nvarchar(max), varbinary(max), xml, or large CLR type. For a non-clustered index, the column could be an include column of the index. For a clustered index, the column could be any column of the table. If DROP_EXISTING is used, the column could be part of a new or old index. The operation must be performed offline.

To be accurate the restriction applies not to tables, but to any index or heap that contains an LOB column. That, of course, includes any clustered index or the base heap of a table if the table contains any LOB columns, but it would include any non-clustered index that includes a LOB column. In other words I can rebuild online non-clustered indexes of any table as long as they don’t use the INCLUDE clause to add a LOB column from the base table, but for sure I cannot rebuild online the table itself (meaning the clustered index or the heap). Nor can I add a clustered index to a base heap online, if the table contains LOB columns.

The question whether one should just use VARCHAR(MAX) and stop worrying about the chosen field size has came up on StackOverflow several times (varchar(max) everywhere?) and I always pointed out that there are at least some limitations (impossibility to do do online maintenance rebuild operations, impossibility to index such fields) and also all MAX types have a slight performance overhead, see Performance comparison of varchar(max) vs. varchar(N)).

Online Index Build, now with LOBs

Starting with SQL Server 11 it is actually permitted to build (and rebuild) online indexes and heaps containing LOB columns. The old legacy types (text, ntext and image) are not supported, not surprising considering that these types are on the deprecation path.

To understand why the original online rebuild operations from previous versions did not support LOB columns we need to consider the SQL Server Table and Index Organization. All indexes and tables consist of three allocation units: one for row data, one for overflow row data and one for LOB data. We can see this if we inspect the sys.system_internals_allocation_units system catalog view:


create table test (id int not null identity(1,1),
	somevar1 varchar(6000),
	somevar2 varchar(6000),
	someblob varchar(max))
go

insert into test (somevar1, somevar2, someblob) values ('A', 'B', 'C');
insert into test (somevar1, somevar2, someblob) values
         (replicate('A', 6000), replicate('B', 6000), replicate('C', 8000))
go

select au.*
from sys.system_internals_allocation_units au
join sys.system_internals_partitions p on au.container_id = p.partition_id
where p.object_id = object_id('test');
go

Our test table shows three allocation units. Now lets rebuild our table and look again at our allocation units:


alter table test rebuild;
go

select au.*
from sys.system_internals_allocation_units au
join sys.system_internals_partitions p on au.container_id = p.partition_id
where p.object_id = object_id('test');
go

We can see that our DATa and SLOB (aka. row overflow) allocation units have changed because they were rebuilt (they have different IDs and start at different pages). But the important thing is that the BLOB allocation unit has not changed. After the offline table rebuild, it has the same ID and starts at the same pages. This is because table and index rebuild operations do not rebuild the LOB data. They rebuild the row data and the row-overflow data, but the newly built rows will simply point back to the same old LOB data. The idea is that tables with LOB columns have large LOB values and rebuilding the LOB data would be prohibitive, with little or no benefit.

Offline operations can avoid rebuilding the LOB data without problems, but for online index and table rebuilds this poses an issue: for the duration of the online rebuild operation both the old rowset (the old index/table) and the new rowset would point to the same LOB data while updates are being made to rows.

In SQL Server 11 this problem was solved and now online operations can rebuild indexes and tables with LOB columns while keeping the data in the LOB allocation unit in a consistent state. SQL Server will internally track how LOB data is referenced by both the old index and the new index being built and will take appropriate actions to manage the sharing of the LOB data.

Limitations

The following restrictions and limitations apply only for the duration of the Online Index Rebuild operation:

Partial LOB .WRITE updates are transformed into full updates.
LOB data supports a highly efficient update mode, the .WRITE syntax. This is critical in creating streaming semantics, see Download and Upload images from SQL Server via ASP.Net MVC. When the .WRITE syntax is used on a LOB column belonging to an index that is being rebuilt online the generated plan will silently change it into a full value update, which generates significantly more log. If you rely heavily on this functionality be aware and schedule your online rebuilds accordingly.
DBCC CHECK operations will skip the consistency check of LOB allocation units belonging to indexes that are in the process of being rebuilt online.
During the online operation the LOB allocation unit is shared between the old index and the new index and is consistent if you consider both owners, however it may look inconsistent if considered from either one of the owner point of view.
File SHRINK operation will skip pages belonging to LOB allocation units belonging to indexes that are in the process of being rebuilt online.
If LOB data is shrunk the pointers in the ROW data referencing the LOB data that had moved have to be updated. While an online index rebuild occurs there could be two sets of pointers referencing the same LOB data, one in the old rowset and one in the new rowset.

Online non-NULL with values column add in SQL Server 11

July 13th, 2011

Prior to SQL Server 11 when you add a new non-NULLable column with default values to an existing table a size-of data operation occurs: every row in the table is updated to add the default value of the new column. For small tables this is insignificant, but for large tables this can be so problematic as to completely prohibit the operation. But starting with SQL Server 11 the operation is, in most cases, instantaneous: only the table metadata is changed, no rows are being updated.

Lets look at a simple example, we’ll create a table with some rows and then add a non-NULL column with default values. First create and populate the table:

create table test (
	id int not null identity(1,1) primary key,
	someValue int not null);
go

set nocount on;
insert into test (someValue) values (rand()*1000);
go 1000

We can inspect the physical structure of the table’s records using DBCC PAGE. First lets find the page that contains the first record of the table:

select %%physloc%%, * from test where id = 1;

In my case this returned 0xD900000001000000, which means slot 0 on page 0xD9 (aka. 217) of file 1, and my test database has the DB_ID 6. Hence the parameters to DBCC PAGE

dbcc traceon (3604,-1)
dbcc page(6,1,217,3)

Page @0x0000000170D5E000

m_pageId = (1:217)                  m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 84    m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594043432960
Metadata: PartitionId = 72057594039042048                                Metadata: IndexId = 1
Metadata: ObjectId = 245575913      m_prevPage = (0:0)                  m_nextPage = (1:220)
pminlen = 12                        m_slotCnt = 476                     m_freeCnt = 4
m_freeData = 7236                   m_reservedCnt = 0                   m_lsn = (30:71:25)
m_xactReserved = 0                  m_xdesId = (0:0)                    m_ghostRecCnt = 0
m_tornBits = 2135435720             DB Frag ID = 1                      

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED           

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 15

Memory Dump @0x000000000AEBA060

0000000000000000:   10000c00 01000000 34020000 020000†††††††††††††........4......

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4
id = 1
Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4
someValue = 564

Note the last LSN that updated the page (30:71:25) and the size of the record in slot 0 (15 bytes). Now lets add a non-NULL column with default values:

alter table test add otherValue int not null default 42 with values;

We can select from the table and see that the table was changed and the rows have value 42 for the newly added column:

select top(2) * from test;

id          someValue   otherValue
----------- ----------- -----------
1           564         42
2           387         42

Yet if we inspect again the page, we can see that is unchanged:

dbcc traceon (3604,-1)
dbcc page(6,1,217,3)

Page @0x0000000170D5E000

m_pageId = (1:217)                  m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 84    m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594043432960
Metadata: PartitionId = 72057594039042048                                Metadata: IndexId = 1
Metadata: ObjectId = 245575913      m_prevPage = (0:0)                  m_nextPage = (1:220)
pminlen = 12                        m_slotCnt = 476                     m_freeCnt = 4
m_freeData = 7236                   m_reservedCnt = 0                   m_lsn = (30:71:25)
m_xactReserved = 0                  m_xdesId = (0:0)                    m_ghostRecCnt = 0
m_tornBits = 2135435720             DB Frag ID = 1                      

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED           

Slot 0 Offset 0x60 Length 15
Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 15

Memory Dump @0x000000000E83A060
0000000000000000:   10000c00 01000000 34020000 020000†††††††††††††........4......

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4
id = 1                              

Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4
someValue = 564                     

Slot 0 Column 3 Offset 0x0 Length 4 Length (physical) 0
otherValue = 42 

The page header is unchanged, the last LSN is still (30:71:25), proof that the page was not modified, and the physical record is unchanged and has the same size as before. Yet DBCC shows a Column 3 and its value 42! If you pay attention you’ll notice that the Column 3 though has an Offset 0×0 and a physical length of 0. Column 3 is somehow materialized out of thin air, as it does not physically exists in the record on this page. The ‘magic’ is that the table metadata has changed and it now contains a column with a ‘default’ value:

select pc.* from sys.system_internals_partitions p
	join sys.system_internals_partition_columns pc on p.partition_id = pc.partition_id
	where p.object_id = object_id('test');

Notice that sys.system_internals_partition_columns now has two new columns that are SQL Server 11 specific: has_default and default_value. The column we added to the test table (the third row in the image above) has a default with value 42. This is how SQL Server 11 knows how to show a value for Column 3 for this record, even though is physically missing on the page. With this ‘magic’ in place the ALTER TABLE will no longer have to update every row in the table and the operation is fast, metadata-only, no matter the number of rows in the table. This new behavior occurs automatically, no special syntax or setting is required, the engine will simply do the right thing. There is no penalty from having a missing value in a row. The ‘missing’ value can be queried, updated, indexed, exactly as if the update during ALTER TABLE really occurred. There is no measurable performance penalty from having a default value.

What happens when we update a row? The ‘default’ value is pushed into the row, even if the column was not modified. Consider this update:

update test set someValue = 565 where id = 1;

Although we did not touch the otherValue column, the row now was modified and it contains the materialized value:

dbcc page(6,1,217,3)

...
m_freeData = 7240                   m_reservedCnt = 0                   m_lsn = (31:271:2)
...
Slot 0 Offset 0x1c35 Length 19

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 19

Memory Dump @0x000000000AB8BC35

0000000000000000:   10001000 01000000 35020000 2a000000 030000††††........5...*......

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4
id = 1                              

Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4
someValue = 565                     

Slot 0 Column 3 Offset 0xc Length 4 Length (physical) 4
otherValue = 42         

KeyHashValue = (8194443284a0)
Slot 1 Offset 0x60 Length 15
Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 15

Memory Dump @0x000000000AB8A060
0000000000000000:   10000c00 02000000 83010000 020000†††††††††††††........ƒ......

Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4
id = 2                              

Slot 1 Column 2 Offset 0x8 Length 4 Length (physical) 4
someValue = 387                     

Slot 1 Column 3 Offset 0x0 Length 4 Length (physical) 0
otherValue = 42

Notice how the physical record has increased in size (19 bytes vs. 15), the record has the value 42 in it (the hex 2a000000) and the Column 3 now has a real offset and physical size. So the update has trully materialized the default value in the row image. I intentionally copied the output of DBCC PAGE for the next slot in the page, to show that the record with id=2 was unaffected, it continues to have a smaller size of 15 bytes and Column 3 has no physical length.

Default value vs. Default constraint

Is worth saying that the new SQL Server 11 default column value is not the same as the default value constraint. The default value is captured when the ALTER TABLE statement is run and can never change. Only rows existing in the table at the time of running ALTER TABLE statement will have missing ‘default’ values. By contrast the default constraint can be dropped or modified and new rows inserted after the ALTER TABLE will always have a value present in row for the new column. Any REBUILD operation on the table (or on the clustered index) will materialize all the missing values as the rows are being copied from the old hobt to the new hobt. The new hobt columns (sys.system_internals_partition_columns) will loose the has_default and default_value attributes, in effect loosing any trace that this column was added online. A default constraint by contrast will be preserved as a table is rebuilt.

Restrictions

Not all data types and default values can be added online. BLOB values like varchar(max), nvarchar(max), varbinary(max) and XML cannot be added online (and frankly I see no valid data model that has a non-NULL BLOB with a default…). Types that cannot be converted to sql_variant cannot be added online, like hierarchy_id, geometry and geography or user CLR based UDTs. Default expressions that require a different value for each row, like NEWID or NEWSEQUENTIALID cannot be added online (the default expression has to be a runtime constant, not to be confused with a deterministic expression, see Conor vs. Runtime Constant Functions for more details). In the case when the newly added column increases the maximum possible row size over the 8060 bytes limit the column cannot be added online. And is an Enterprise Edition only feature. For all the cases above the behavior will revert to adding the column ‘offline’, by updating every row in the table during the ALTER TABLE statement, creating a size-of-data update. When such a situation occurs a new XEvent is fired, which contains the reason why a size-of-data update occurred: alter_table_update_data.

SIGMOD 2011 Webcasts

June 23rd, 2011

This year SIGMOG keynotes are available online at https://services.choruscall.eu/links/sigmod1106.html

  • June 13th: A Quest for Beauty and Wealth (or, Business Processes for Database Researchers)
  • June 14th: Internet-Scale Storage (I highly recommend this session)
  • June 15th: SIGMOD Awards (this includes a segment on the SQL Server team)
  • June 16th: Managing Scientific Data

Scale out SQL Server by using Reliable Messaging

June 1st, 2011

How do large-scale sites and applications remain SQL-based? is a recent article from Michael Rys (Blog|Twitter) that highlights the relational SQL Server based web-scale deployment at MySpace. I have talked before about how MySpace uses Service Broker as a reliable messaging backbone to power the communications between +1000 databases, allowing them to scale-out and partition the user information into individual shards. Here are some more details about this architecture:

This new article uses the MySpace deployment as a case study to counter balance the claim that large web-scale deployments require the use of NoSQL storage because relational database cannot scale. BTW I know the SQL vs. NoSQL discussion is more subtle, but I won’t enter into details here. I think a good read on that topic is NoSQL vs. RDBMS: Let the flames begin!.

Why is reliable messaging a key tenet of implementing a data-partitioned scale-out application? Consider a typical example of an modern web-scale application: users connect, get authenticated and view their own profile, but they are also interested in the status updates or wall messages from their network of friends or followers. It is easy to see how one partitions the user profile data, but how do you partition the user ‘walls’? User A is in a partition hosted on node 1, while user B is in a partition hosted by node 2, when User A update his status, how does this show up on User B’s wall?

One option is to have the application update the wall of User B when User A changes his status. But this prevents scalability, because now writes have to occur on many nodes and the application has to orchestrate all these writes. Think Lady GaGa updating her status, the application has to update the wall of every follower.

Another option is to have the application read the status from User A when displaying the wall of User B. This also doesn’t scale, because reads now occur on many nodes. All those Lady GaGa followers refreshing their wall page have to read from the one node hosting her status, and the node is soon overwhelmed.

A solution is to replicate the write on User A’s status onto User B’s wall. The application only updates the User A status, and the infrastructure propagates the this update to User B’s wall.

But traditional replication was historically designed for replicating entire data sets of fixed schema over static topologies, and it falls short of replicating web-scale deployments of hundreds and thousands of nodes:

  • it depends too tightly on physical location and it cannot adapt to rapid changes of topology (nodes being added and removed)
  • its based on schema defined filtering which is difficult to map to the complex application specific data routing conditions (this update goes to node 2 because User B follows User A, but that update goes to node 3 because User D follows User C)
  • its very sensitive to schema changes making application upgrade roll outs a big challenge

Messaging is designed with application-to-application communication in mind and has different semantics that are more friendly on large scale-out deployments:

  • Logical routing to isolate application from topology changes
  • Protocol versioning information allows side-by-side deployments making application upgrade roll outs possible
  • Data schema can change more freely as peers are shielded from changes by keeping the communication protocol unchanged

With messaging in place the application updates the status of User A and the drops a message into a local outbound queue to notify all friends of A. The reliable messaging infrastructure dispatches this message to all nodes interested and processing of this message results in an update of User B wall. MySpace uses Service Broker as the reliable messaging infrastructure, and they make up for the lack of publish/subscribe by adding a router-dispatcher unit: all messages are sent to this dispatcher and the dispatcher in turn figures out how many subscribers have to be notified, and sends individual messages to them. See the slides linked at the beginning of my post for more details. As a side note, I see that StackExchange is also embarking on the route of creating a message bus for their pub/sub infrastructure, but they use Redis as a message store, see async Redis await BookSleeve.

Robert Scoble had a post MySpace’s death spiral: insiders say it’s due to bets on Los Angeles and Microsoft on which it claims that MySpace insiders blame their loss to Facebook to, amongst other things, the complexity of this infrastructure:

Workers inside MySpace tell me that this infrastructure, which they say has “hundreds of hacks to make it scale that no one wants to touch” is hamstringing their ability to really compete.

I do no know if the sources quoted by the article are right or wrong, and I was never personally involved with the MySpace deployment, but since I was so closely involved in building SQL Service Broker and as SSB is one of the critical components used by MySpace infrastructure, I am understandably interested in this discussion. Service Broker has a notoriously steep learning curve, there are no tools for administer, monitor and troubleshoot Service Broker deployments, and there is absolutely no support in the client programming stack (the managed .Net Framework). This is why all solutions that deploy Service Broker that I know of are large enterprise shops that are staffed with architects that understand the set of unique advantages this technology brings, and are willing to venture into uncharted waters despite the prospect of being impossible to hire development and ops talent with Service Broker expertise. But also the feedback I hear from these deployments is almost always positive: once deployed, Service Broker just works. Someone told me that the Service Broker solution they had was the only piece of technology that “did not break in the last 3 years” and that covers an upgrade from SQL Server 2005 to SQL Server 2008. See Is Service Broker in SQL Server 2008 compatible with the one in SQL Server 2005? to see how Service Broker helps address infrastructure upgrade problems. Personally I am not surprised that Service Broker turns out to be a cornerstone of the response SQL Server has to give to the NoSQL challenge, but this vitally unheard of technology (89 questions tagged service-broker on StackOverflow at time of writing this) will be quite a surprise answer for many.

How to determine the database version of an MDF file

April 4th, 2011

If you have an MDF file laying around and you don’t know what version is, how can you determine the version, preferably without altering the file? You could try to attach it to a SQL Server instance, but if you’re not careful the operation may end up upgrading the MDF file to that instance’s current version. Even if it doesn’t upgrade it because it happens to be the same version as the one supported by that instance, it can still run recovery on the database and thus alter the file. Is there a way to determine the version in a non destructive fashion?

Lets look into a database and dump a page. Not any page, but page 9 of the primary filegroup. This page happens to be the database boot page:

dbcc traceon(3604)
dbcc page(1,1,9,3);

the important bits of information are these:

...
Memory Dump @0x00000000124FA060
0000000000000000:   0000a405 95029502 00000000 00000000 †..¤.•.•.........
...
DBINFO @0x00000000124FA060
...
dbi_dbid = 1                         dbi_status = 65544                   dbi_nextid = 1723153184
dbi_dbname = master                  dbi_maxDbTimestamp = 4000            dbi_version = 661
dbi_createVersion = 661              dbi_ESVersion = 0
...

So at offset 0×60 on the page (96 decimal, which we know is the size of the page header) there is a DBINFO structure. This structure contains the dbi_version 661, which is 0×295 in hex. In the DBCC PAGE output this would show as 9502, and we can see that there are two such values at offset 0×64 and 0×66. So the database version is stored in the two bytes at offset 0×64 on the 9th page of the primary filegroup. The 9 page will be at offset 0×12000 in the file (just take the page size, 8k, and multiply it by 9). So the version of the MDF will be the DWORD value at offset 0×12064 in the file.

There are many ways you can read these bytes using your favourite hex file viewer/editor. Even Powershell can do it:

PS > get-content -Encoding Byte "...\foo.mdf" | select-object -skip 0x12064 -first 2
149
2
PS > 2*256+149
661

The first Powershell line dumped the two bytes at offset 0×12064 in the file: 149 and 2 (Powersell displays the value in decimal encoding…). Convert the two bytes to a DWORD gives us the MDF file version: 661, which is the SQL Server 2008 R2 version.

Erland’s Sommarskog adds another gem to his treasure chest

February 21st, 2011

There are probably no articles that I reference more often than Erland’s. There are questions on forums, discussion groups and stackoverflow that come back over and over, again and again. For some of these topics Elands has prepared articles that are, ultimately, the article to point anybody for more details on that topic. His coverage is deep, the articles are well structured, and they pretty much exhaust the topic. If somebody still have questions about that topic after reading the article, then I say that person knows too much about SQL Servero … or not enough. I’m writing this blog entry not so much as to announce his latest addition Slow in the Application, Fast in SSMS? Understanding Performance Mysteries (is hard for me to believe anyone that follows my blog does not know about them…) but more to create for myself a memo pad from where to pick up the link to the articles whenever I need them:

The Curse and Blessings of Dynamic SQL
Dynamic-SQL, the technique of generating and execute SQL code on-the-fly is sometimes irreplaceable, sometimes abused, and almost always done in a SQL-injection prone, dangerous, way. This article covers the pros and cons of this technique, when to use and when not, how to use it and what pitfalls to avoid.
Arrays and Lists in SQL Server
In lack of a true array type in the Transact-SQL language, developers have resorted to all sort of workarounds to pass sets of data as parameters to Transact-SQL procedures and batches: strings containing comma delimited items, XML variables, temp tables, table value parameters. Erland has a series of articles covering this topic in SQL Server 2008 (where table valued parameters are available), SQL Server 2005 (where XML data type became available) and SQL Server 2000 (the Dark Age).
How to Share Data Between Stored Procedures
All you need to know about how to share data between procedures and how to choose among the various solutions: table valued parameters, table valued functions, temp tables, XML data type, INSERT-EXEC, OPENQUERY, cursor variables, CLR.
Error Handling
Another series that covers error handling in Transact-SQL in SQL Server 2005 and later using TRY-CATCH blocks as well as SQL Server 2000 before TRY-CATCH blocks were available.
Giving Permissions through Stored Procedures
Code signing is, by a large margin, the perfect way to effectively grant granular control to non-privileged users for specific operations in SQL Server 2005 and later. This powerful mechanism is though seldom used, because of the arcane ‘black magic’ that goes into getting the sequence of operations correct to effectively sign a procedure, and because of the complication that arrise from the EXECUTE AS impersonation context required for signed procedures. This article goes to great length in shading some light on this esoteric topic.
Slow in the Application, Fast in SSMS?
This article describes parameter-sniffing, the process of ‘tuning’ a query plan to a specific value of the input parameters, and how this process can sometime ‘pollute’ the query plan cache with a plan that is optimized for a specific set of parameters, but performs poorly on other parameter values. The article describes what is parameter sniffing, how to identify when it happens, and how to fix it and prevent it.

This server supports version 662 and earlier…

November 23rd, 2010

A new error started showing up in SQL Server 2008 SP2 installations:

The database cannot be opened because it is version 661. This server supports version 662 and earlier. A downgrade path is not supported.

661 sure is earlier than 662, so what seems to be the problem? This error message is a bit misleading. SQL Server 2008 supports database version 655 and earlier. But with support for 15000 partitions in SQL Server 2008 SP2, databases enabled for 15000 partitions are upgraded to version 662. This upgrade is necessary to prevent an SQL Server 2008 R2 instance from attaching a database that has more than 1000 partitions in it, since the code in R2 RTM does not understand 15000 partitions and the effects would be unpredictable. So SQL Server 2008 SP2 does indeed support version 662, but it does not support version 661. This behavior is explained in the Support for 15000 Partitions.docx document, although the database versions involved are not explicitly called out.

So the error message above should be really read as:

The database cannot be opened because it is version 661. This server supports versions 662, 655 and earlier than 655. A downgrade path is not supported

With this information the correct resolution can be achieved: the user is trying to attach a SQL Server 2008 R2 database (v. 661) to an SQL Server 2008 SP2 instance. This is not supported. User has to either upgrade the SQL Server 2008 SP2 instance to SQL Server 2008 R2, or it has to attach the database back to a R2 instance and copy out the data from the database into SQL Server 2008 instance database, eg. using the Import and Export Wizard.

TRY CATCH THROW: Error handling changes in T-SQL

November 22nd, 2010

When SQL Server 2005 introduced BEGIN TRY and BEGIN CATCH syntax, it was a huge improvement over the previous error handling based on @@ERROR check after each statement. Finally, T-SQL joined the rank of programming languages, no more just a data access language. Experience has shown that exception handling leads to better code compared to error checks. Yes, SEH is slower, but is basically impossible to maintain the code discipline to check @@ERROR after every operation, so exception handling is just so much easier to get right. And besides, @@ERROR never had such a masterpiece article to guide you trough like A Crash Course on the Depths of Win32™ Structured Exception Handling.

But when trying to use the new TRY/CATCH exception handling in T-SQL code, one problem quickly became apparent: the CATCH block was masking the original error metadata: error number/severity/state, error text, origin line and so on. Within a CATCH block the code was only allowed to raise a *new* error. Sure, the original error information could be passed on in the raised error message, but only as a message. The all important error code was changed. This may seem like a minor issue, but turns out to have a quite serious cascading effect: the caller now has to understand the new error codes raised by your code, instead of the original system error codes. If the application code was prepared to handle deadlocks (error code 1205) in a certain way (eg. retry the transaction), with a T-SQL TRY/CATCH block the deadlock error code would all of the sudden translate into something above 50000.

Read the rest of this entry »

AlwaysOn: High-Availability and reads Scale-Out

November 11th, 2010

Along with SQLPASS Summit 2010 announcements on SQL Server “Denali” features the MSDN site has published preliminary content on some of these features. Going over the “HADR” Overview (SQL Server) content we can get an early idea about this feature. This post summarizes the AlwaysOn technology, and compares it with its predecessor and close cousin, Database Mirroring. For brevity, I am intentionally omitting a lot of details.

The AlwaysOn technology in SQL Server “Denali”, also known by the project name “HADR” and often called Hadron, it is a huge improvement over its predecessor, Database Mirroring. Like Mirroring, AlwaysOn is also based on physical replication of database by shipping over the transaction log. In fact, it is not only similar to Database Mirroring but actually using the DBM technologies to replicate the database. The steps to set up AlwaysOn contain the steps to set up a Mirroring sessions, and the mirroring endpoints, catalog views and DMVs are still used to set up and monitor AlwaysOn. But AlwaysOn brings three more Aces to the table to make an unbeatable play:

Availability Groups
Databases with dependencies on one another fail over together, as a group.
Multiple Secondaries
AlwaysOn will allow for multiple standby replicas for each availability group.
Readable Secondaries
The standby replicas are accessible for read-only operations.

The following table shows the main differences between AlwaysOn and Database Mirroring:

Feature Mirrroring AlwaysOn
Unit of failover Single Database Availability Group
Secondary Access Database Snapshots Read-Only access
Quorum and failover Witness Windows Server Clustering
Number of secondaries Exactly one Number of nodes in the cluster

Availability Groups

DBM considers each mirrored database as an individual entity:

But often application are deployed on several databases contained in an instance and there are tight dependencies between them (eg. cross database queries and procedures). With DBM one had to set up complicated logic to force failover of all related databases if one individual database was occurring a failover event. AlwaysOn introduces Availability Groups that allows explicit declaration of such dependencies. A failover occurs always as an entire group, all databases in the group fail over together become available on the new primary host.

The individual databases inside an Availability Group are still replicated using the Database Mirroring technology:

Multiple Secondaries

Database Mirroring is a private affair between two hosts. A mirroring session can only have on Principal and one Mirror, and they can switch roles. From a High Availability point of view the risk involved in losing one of the partner was fairly serious. Until the operations were set in place to either add a new partner or bring back the old partner online, the system would run at the risk of loosing availability on any further incident. With AlwaysOn there can be multiple stand-by secondary availability groups, on multiple hosts. This allows for a much safer operations, where multiple failures could be survived without loss of availability. This reduces the cost of achieving ‘five nines’ availability numbers, and eases planning, deployment and operating of mission critical systems.

The figure bellow show a possible AlwaysOn deployment on a 4 node cluster:

  • Availability Group 1 contains 2 databases and 3 nodes of the cluster have joined this availability group. The SQL Server instance on node A is the the primary replica and the ones nodes B and C each host an availability replica.
  • Availability Group 2 contains one database, it has the primary availability group running on the SQL Server instance on node B of the cluster and an availability secondary replica on the instance on node A of the cluster.
  • Availability Group 3 contains five databases, it has the primary availability group running on the SQL Server instance on node D of the cluster and an availability secondary replica on the instance on node C of the cluster.
  • Availability Group 4 contains two databases, it has the primary availability group running on the SQL Server instance on node B of the cluster and an availability secondary replica on the instance on node D of the cluster.

Readable Secondaries

Secondary replicas are readable in real-time. All read-only operations are allowed on databases in the secondary availability groups. Coupled with the capability to have multiple secondaries, this gives a very nice solution for Scale Out reads. Unlike Database Mirroring database snapshot solution, that was giving a point-in-time snapshot view of the database, readable secondaries allow for real query access to the content of the secondary database, in real-time for up-to-date changes. Access is read-only, no updates are permitted, and all queries run automatically under snapshot isolation model (lock hints and explicitly set isolation levels are ignored). The queries always get a transactionally consistent view of the data, as fresh as allowed by the underlying mirroring session that continuously updates the database. With synchronous mirroring this means that the scale-out achieves the all elusive golden standard of reads that are always perfectly consistent with the last committed writes *on any replica*, with no lag! So far this was impossible to achieve with any other scale-out technology [Correction 11/22: As Gopal points out bellow, this is still not achievable because synchronous mirroring only requires the shipped log to be hardened to disk, not redone. You can get pretty close, but but the application still needs to be prepared to handle stale reads].

The following image shows a possible AlwaysOn reads Scale-Out deployed on a 3 node cluster to serve a web farm. Read requests can be server by any of the availability replicas, including the Primary Replica. Write requests have to be all routed to the Primary Replica:

What gives

Database Mirroring has always been a viable choice for small businesses. Available in Standard Edition, running on commodity hardware and with minimal licensing cost requirements, it was basically dirt cheap to implement. AlwaysOn is on a different league. Its availability is no longer based on quorum decided by a witness, as in Database Mirroring, but instead is based on Windows Server Failover Clustering. WSFC is a premium feature available only on Enterprise and DataCenter editions of the Windows Server family, and is supported only on hardware that has passed the Validate a Configuration Wizard. Small businesses can still deploy basic Database Mirroring, but they will have to support a significantly higher entry cost to enjoy the advantages of the AlwaysOn technology.

Plagiarizing Forum answers

September 28th, 2010

Today I learned about anew form of plagiarism: forum answers without proper attribution. There is an user on MSDN that cowers under the moniker of Learning_SQL. What the weasel does it trolls the MSDN forums for questions, takes the question verbatim and posts it on StackOverflow, then takes the answers from StackOverflow and posts them back as his own answer on MSDN.

See http://stackoverflow.com/questions/3816683/index-to-speed-up-delete and http://social.msdn.microsoft.com/Forums/en/transactsql/thread/00667b94-1fb5-4238-b169-c596a2ae25ed.

Or http://stackoverflow.com/questions/3816023/sys-dm-fts-parser-permission and http://social.msdn.microsoft.com/Forums/en/transactsql/thread/a3ec4602-e0be-48f5-92ad-2eb25a6befc5.

http://stackoverflow.com/questions/3787986/need-an-idea-for-doing-bulk-compare-and-insert and http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/ed85fb0b-55db-474d-a45c-d715d659b0a8/#6157a812-82f8-473a-a689-d4217e1fbc13

http://stackoverflow.com/questions/3815502/stored-procedure-when-to-use-output-parameter-vs-return-variable/3815545#3815545 and http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/6013bdf2-1452-4495-865a-1bfcc8b361d0/#5f17fd95-16d0-40eb-b587-db35e44bdb69

This is not some accidental one copy of a answer, is a consitent behavior done with the sole purpose of acumulating MSDN reputation. Moral consideration aside, Learning_SQL is actually violating the StackOverflow attribution required content license:

  1. Visually indicate that the content is from Stack Overflow, Meta Stack Overflow, Server Fault, or Super User in some way. It doesn’t have to be obnoxious; a discreet text blurb is fine.
  2. Hyperlink directly to the original question on the source site (e.g., http://stackoverflow.com/questions/12345)
  3. Show the author names for every question and answer
  4. Hyperlink each author name directly back to their user profile page on the source site (e.g., http://stackoverflow.com/users/12345/username)

Incidents of plagiarization occur frequently on blogs, but with such a small comunity as the SQL Server one, where pretty much everybody knows everybody, the perpetrator doesn’t get too far. Brent Ozar (blog|Twitter) covered this subject on several articles:

But I guess imagination has no bounds, and now some new players (or same old ones?) are trying new grounds…