%%lockres%% collision probability magic marker: 16,777,215

May 29th, 2009

@jrowlandjones blogged about a dubious deadlock case. I recommend this article as is correct and presents a somewhat esoteric case of deadlock: the resource hash collision. The lock manager in SQL Server doesn’t know what it locks, it just locks ‘resources’ (basically strings). It is the job of higher level components like the the access methods of the storage engine to present the ‘resource’ to the lock manager and ask for the desired lock. When locking rows in a heap or a b-tree the storage engine will synthesize a ‘resource’ from the record identifier. Since these resources have a limited length, the storage engine has to reduce the effective length of a key to the maximum length is allowed to present to the lock manager, and that means that the record’s key will be reduced to 6 bytes. This is achieved by hashing the key into a 6 byte hash value. Nothing spectacular here.

But if you have a table with a key of length 50 bytes and its reduced to 6 bytes, you may hit a collision. So how likely is this to happen?

On 6 bytes there are 281,474,976,710,656 distinct possible values. Its a pretty big number? Not that big actually. If we meet at a party and I say ‘I bet somebody in the room shares your birthday’ would you bet against me? You probably should :) What if I change my question to ‘I bet there are two people in this room that share the birthday!’? Now I will probably take your money. This is called a ‘meet-in-the-middle‘ attack in cryptography and basically it says that you get a 50% collision probability at half the hash length. So the SQL %%lockres%% hash will produce two records with same hash, with a 50% probability, out of the table, any table, of only 16,777,215 record. That suddenly doesn’t look like a cosmic constant, does it? And keep in mind that this is the absolutely maximum value, where the key has a perfectly random distribution. In reality keys are not that random after all. Take Jame’s example: a datetime (8 bytes), a country code (1 byte), a group code (2 bytes) and a random id (4 bytes). From these 15 bytes quite a few are actually constant: eg. every date between 2000 and 2010 has the first 4 bytes identical (0×00) and the 5th byte only has two possible values (0×08 or 0×09). If from the other codes (country, group) we use only 50% of the possible values, then in effect we use, generously, just 10 bytes of the 15 bytes of the key. This means the table has a 50% collision probability at only about 11 million records. considering that he was doing a ‘paltry’ 900 million records upload, no wonder he got collisions…

Have you met  ? Say hello to my BOM

May 21st, 2009

I recently had to look at a problem where a programmer writing a tool to parse some XML produced by a service written by me was complaining that my service serves invalid XML. All my documents seemed mangled and started with . I couldn’t help but smile. So what is this mysterious sequence ? Well, lets check the ISO-8859-1 character encoding, commonly known as ‘Latin alphabet’ and often confused with the Windows code page 1252: ï is the character corresponding to 0xEF (decimal 239), » is 0xBB (decimal 187) and ¿ is 0xBF (decimal 191). So the mysterious sequence is 0xEFBBBF. Does it look familiar now? It should, this is the UTF-8 Byte Order Mark. Moral: if you consume and parse XML, make sure you consume it as XML, not as text. All XML libraries I know of correctly understand and parse the BOM. The only problems I’ve seen are from hand written ‘parsers’ that treat XML as a string (and most often fail to accommodate namespaces too…).

stackoverflow.com: how to execute well on a good idea

May 18th, 2009

Some time ago I started noticing on my google searches a newcomer: stackoverflow.com. At first I dismissed this as yet another SEO hack to divert traffic to some re-syndicated content of the old user groups and forums. But I was wrong. Turns out stackoverflow.com is an enterprise backed by well known industry names like Joel Spolsky of Joel on Software fame. Apparently I’ve been living in a cave, since this new site is quite popular and even doing a Stack Overflow DevDays tour!

Now being that I’m a forums and newsgroup addict with a long history of MSDN abuse, I had to join this one and start showing off my amazing knowledge and wit. OK, you can all stop laughing now. I am a noob, so what? I still love to answer questions ;) and not actually knowing the answer has never a stop for me. Imagination is more important than knowledge!

I spend now about 4 days on stackoverflow.com and I must say that I’m impressed. First of all, they do offer innovation in how the content is gathered and presented. The hierarchical forums model was obviously obsolete and it was showing its age: new users have a hard time figuring out which forum is the right one, monitoring the questions is difficult as the volume increases, there are often questions that obviously span multiple topics and picking the one forum for it severely restricts the exposure of the said question, and implicitly the quality of responses. Instead stackoverflow.com goes for a tags based model. When you ask a question you choose tags relevant for it and you can mix and match tags as diverse as linq, objective-c and php in one single question.

Now tags based contents isn’t exactly new, but the way is executed on stackoverflow.com takes it to a new level. Of course they offer tags based browsing of topics. But they also keep track of the tags you must often interact with (ask or answer). You can browse tags within current tags to get the questions that cover multiple tags of your interest. And the tags system is completely open, anyone can create new tags and they even have awards for successful tags.

The next innovation idea I like is how they try to blend the line between wiki and forums. Topics that prove to be popular and have a good answer can be promoted to wiki entries. This makes the entry serve the same reference role the ‘sticky’ posts serve in forums, but with better functionality. Also rooted in wikis (and craigslist too) is the idea of member provided social policing for the content: answers get voted up or down by community members. Not only that, but questions also get to be voted up or down, which is something I have not seen elsewhere. And ultimately questions can be closed, responses deleted. How is this different from the forum administrators? These people are not administrators, are just ordinary community members. You gain reputation, you earn privileges.

The reputation system is not new, by now almost any community forum has a reputation points system in place. But with stackoverflow.com they added also a system of badges that I feel comes straight from the video games world of achievements and vanity awards : you get bronze, silver or gold badges for achieving tasks in the stackoverflow.com ecosystem. You get your Teacher badge for answering a question and receiving an Up vote, you get the Student badge for answering a question that receives an up vote, or even a gold badge of Great Answer if it gets voted up 100 times. Now these are, of course, vanity awards. We all know though how efficient they are in keeping users hooked in! Me, I’m eager to get my Critic badge…

The only serious thing missing is the RSS syndication of views, but I hear is in the plans.

But most impressive is the quality of execution on these good ideas. The site is fast and responsive. It provides suggestion to similar questions as you type yours. It provides fast navigation to your questions and answers. Visual notifications for changes since your previous check. Suggestions for related topics (ie. common tags). Is true that I don’t know how many users it carries. Judging from the ~30K Teacher badges, I’d guess some 50k users registered and active, as a conservative estimate.

Is also nice to see such an effort started from a grassroots movement, and not from the political sponsorship of an industry player. Today’s developer has to deal in the course of a single day with an NSConnection question, related to an issue of Appache .htaccess mod_rewrite and PHP cookie handling and resulting in a SQL Server access problem. A site like the Social on MSDN would not happily sponsor and encourage such questions, nor would it nurture and grow the community leaders that can answer such end-to-end and cross platform questions.

Read/Write deadlock

May 16th, 2009

How does a simple SELECT deadlock with an UPDATE? Surprisingly, they can deadlock even on well tuned systems that does not do spurious table scans. The answer is very simple: when the read and the write use two distinct access paths to reach the same key and they use them in reverse order. Lets consider a simple example: we have a table with a clustered index and an non-clustered index. The reader (T1) seeks a key in the non-clustered index and then needs to look up the clustered index to retrieve an additional column required by the SELECT projection list. The writer (T2) is updating the clustered index and then needs to perform an index maintenance operation on the non-clustered index. So T1 holds an S lock for the key K on the non-clustered index and wants an S lock on the same key K on the clustered index. T2 has an X lock on the key K on the clustered index and wants an X lock on same key K on the non-clustered index. Deadlock, T1 will be chosen as a victim. So you see, there are no complex queries involved, no suboptimal scan operations, no lock escalation nor page locks involved. Simple, correctly written queries may deadlock if doing read/write operations on the same key on a table with two indexes. Lets show this in an example:

Read the rest of this entry »

Version Control and your Database

May 15th, 2009

I am still amazed when I walk into a development shop and I ask for their application database script and they offer to extract one for me. Really, your only definition of the database is the database itself? Now you wouldn’t keep your libraries as object code only and reverse engineer them every time you want to make a change, would you?

Now, all sarcasm aside, why is so hard to keep a database definition as source and keep it under version control? The reason is not that people are dumb, these are bright developers and they would do the right thing if it would fit into their natural work flow. The problem is that the tool set at their disposal as developers (usually the Visual Studio suite) is far far behind the capabilities of the database administration tool set (the SSMS). But the later is focused for the needs of administrators and the natural flow of actions is to visually modify some schema properties (add tables, define indexes etc) in a dialog and then click the ‘Do it!’ button. This hides actual scripts going on behind the scenes and does not lend itself naturally to the normal code/build/run/test/commit cycle of the developer desk.

Read the rest of this entry »