Fix slow application startup due to code sign validation

July 24th, 2009

Sometimes you are faced with applications that seem to take ages to start up. Usually they freeze for about 30-40 seconds and then all of the sudden they come to live. This happens for both native and managed application and it sometimes manifest as an IIS/ASP/ASP.Net AppPool starting up slow on the first request. The very first thing I always suspect is code signing verification. When a signed module is checked the certificate verification engine may consider that the Certificate Revocation List (CRL) it posses is obsolete and attempt to download a new one. For this it connects to the internet. The problem occurs when the connectivity is either slow, or blocked for some reason. By default the verification engine will time out after 15 seconds and resume with the old, obsolete, CRL it has. The timeout can occur several times, adding up to start up times of even minutes. This occurs completely outside of the control of the application being started, its modules are not even properly wired up in memory so there is no question of application code yet running.

The information on this subject is scarce to say the least. Luckily there is an TechNet article that describes not only the process occuring, but also the controlling parameters: Certificate Revocation and Status Checking. To fix the problem on computers with poor internet conectivity, registry settings have to be modified in the HKLM\SOFTWARE\Microsoft\Cryptography\OID\EncodingType 0\CertDllCreateCertificateChainEngine\Config key:

ChainUrlRetrievalTimeoutMilliseconds
This is each individual CRL check call timeout. If is 0 or not present the default value of 15 seconds is used. Change this timeout to a reasonable value like 200 milliseconds.
ChainRevAccumulativeUrlRetrievalTimeoutMilliseconds
This is the aggregate CRL retrieval timeout. If set to 0 or not present the default value of 20 seconds is used. Change this timeout to a value like 500 milliseconds.

With these two changes the code signing verification engine will timeout the CRL refresh operation in 500 milliseconds. If the connectivity to the certificate authority site is bad, this will dramatically increase the application start up times for code signed applications.

%%lockres%% collision probability magic marker: 16,777,215

May 29th, 2009

@jrowlandjones blogged about a dubious deadlock case. I recommend this article as is correct and presents a somewhat esoteric case of deadlock: the resource hash collision. The lock manager in SQL Server doesn’t know what it locks, it just locks ‘resources’ (basically strings). It is the job of higher level components like the the access methods of the storage engine to present the ‘resource’ to the lock manager and ask for the desired lock. When locking rows in a heap or a b-tree the storage engine will synthesize a ‘resource’ from the record identifier. Since these resources have a limited length, the storage engine has to reduce the effective length of a key to the maximum length is allowed to present to the lock manager, and that means that the record’s key will be reduced to 6 bytes. This is achieved by hashing the key into a 6 byte hash value. Nothing spectacular here.

But if you have a table with a key of length 50 bytes and its reduced to 6 bytes, you may hit a collision. So how likely is this to happen?

On 6 bytes there are 281,474,976,710,656 distinct possible values. Its a pretty big number? Not that big actually. If we meet at a party and I say ‘I bet somebody in the room shares your birthday’ would you bet against me? You probably should 🙂 What if I change my question to ‘I bet there are two people in this room that share the birthday!’? Now I will probably take your money. This is called a ‘meet-in-the-middle‘ attack in cryptography and basically it says that you get a 50% collision probability at half the hash length. So the SQL %%lockres%% hash will produce two records with same hash, with a 50% probability, out of the table, any table, of only 16,777,215 record. That suddenly doesn’t look like a cosmic constant, does it? And keep in mind that this is the absolutely maximum value, where the key has a perfectly random distribution. In reality keys are not that random after all. Take Jame’s example: a datetime (8 bytes), a country code (1 byte), a group code (2 bytes) and a random id (4 bytes). From these 15 bytes quite a few are actually constant: eg. every date between 2000 and 2010 has the first 4 bytes identical (0x00) and the 5th byte only has two possible values (0x08 or 0x09). If from the other codes (country, group) we use only 50% of the possible values, then in effect we use, generously, just 10 bytes of the 15 bytes of the key. This means the table has a 50% collision probability at only about 11 million records. considering that he was doing a ‘paltry’ 900 million records upload, no wonder he got collisions…

Have you met  ? Say hello to my BOM

May 21st, 2009

I recently had to look at a problem where a programmer writing a tool to parse some XML produced by a service written by me was complaining that my service serves invalid XML. All my documents seemed mangled and started with . I couldn’t help but smile. So what is this mysterious sequence ? Well, lets check the ISO-8859-1 character encoding, commonly known as ‘Latin alphabet’ and often confused with the Windows code page 1252: ï is the character corresponding to 0xEF (decimal 239), » is 0xBB (decimal 187) and ¿ is 0xBF (decimal 191). So the mysterious sequence is 0xEFBBBF. Does it look familiar now? It should, this is the UTF-8 Byte Order Mark. Moral: if you consume and parse XML, make sure you consume it as XML, not as text. All XML libraries I know of correctly understand and parse the BOM. The only problems I’ve seen are from hand written ‘parsers’ that treat XML as a string (and most often fail to accommodate namespaces too…).

Read/Write deadlock

May 16th, 2009

How does a simple SELECT deadlock with an UPDATE? Surprisingly, they can deadlock even on well tuned systems that does not do spurious table scans. The answer is very simple: when the read and the write use two distinct access paths to reach the same key and they use them in reverse order. Lets consider a simple example: we have a table with a clustered index and an non-clustered index. The reader (T1) seeks a key in the non-clustered index and then needs to look up the clustered index to retrieve an additional column required by the SELECT projection list. The writer (T2) is updating the clustered index and then needs to perform an index maintenance operation on the non-clustered index. So T1 holds an S lock for the key K on the non-clustered index and wants an S lock on the same key K on the clustered index. T2 has an X lock on the key K on the clustered index and wants an X lock on same key K on the non-clustered index. Deadlock, T1 will be chosen as a victim. So you see, there are no complex queries involved, no suboptimal scan operations, no lock escalation nor page locks involved. Simple, correctly written queries may deadlock if doing read/write operations on the same key on a table with two indexes. Lets show this in an example:

Read the rest of this entry »

CLR Memory Leak

January 19th, 2009

On a recent client engagement I had to investigate what appeared a to be a memory leak in a managed application. The program was running for about a week now and it appeared to slowly degrade in performance over time. Although it appeared healthy at only about 245 MB of memory used, I decided to investigate. The fastest way in my opinion to track down leaks in a running environment is to attach Windbg and use SOS:

Read the rest of this entry »

Replacing Service Certificates that are near expiration

November 26th, 2008

Service Broker services use certificates for authenticating message origin and for encrypting messages. I have explained in detail how this authentication works in my earlier post Conversation Authentication. The certificates used for service authentication are most times self-signed certificates created directly by SQL Server using CREATE CERTIFICATE and they expire by default one year after creation. When these certificates expire they have to be replaced and this article is goal is to help you do thisreplacement with no impact on production systems.

Identifying the certificates used by services

Here is a recap of the criteria Service Broker uses to pick a certificate to represent an service identity:

Read the rest of this entry »

Event ID 833: I/O requests taking longer than 15 seconds

October 28th, 2008

The error 833 is usually associated with hardware or system driver problems and the typical recommendation is to replace the hardware or update the drivers and firmware used. However there is a common scenario that leads to this problem when your hardware is fine and sound.

Read the rest of this entry »

Certificate Not Yet Valid

August 25th, 2008

So you are running into a blocking issue, try to find a solution and you gets back a reply like ‘Sorry, I cannot repro your problem’. You go back and try again, and sure the problem is still there. You dig deeper, go to the Advanced search options on Google, but still cannot find the answer. Desperately, you try again and, surprise, the problem has vanished. Sounds familiar?

I know of one such problem that appears again and again when discussing Service Broker, Database Mirroring and, to a lesser extent, SQL Server 2005 Cryptographic facilities. I want to show you what the problem is, what’s causing it and even drill a bit into how something so basic has slipped into the product.

Read the rest of this entry »

Error Handling and Activation

August 13th, 2008

I have previously talked here about the queue monitors and the role they play in launching activated procedures. If you recall, I’ve mentioned that the Queue Monitors will enter the NOTIFIED state after they launch a stored procedure and not launch again the procedure until the Queue Monitor notices that RECEIVE statements are being issued against the queue. In an older post I have also talked about how difficult is to get error handling right, and in particular cast and convention errors. This may seem a trivial problem but in the Service Broker programs it is actually a serious problem because of the frequent conversation to and from XML. These two separate issues can actually ‘cooperate’ into a somehow surprising behavior. Namely if the activated procedure hits an error before it completes the RECEIVE statement, the Queue Monitor will stay in NOTIFIED state and won’t activate again the procedure. Although this looks similar to the typical poison message behavior when the queue is automatically disabled, this is a different issue. And unlike the poison message case, in the case when the Queue Monitor is stranded in NOTIFIED state you can not get a notification that the queue is no longer functional.

This post continues with an example showing how a relatively safe activated procedure can end up in this situation.

Read the rest of this entry »

Understanding Queue Monitors

August 3rd, 2008

A major class of Service Broker applications have nothing to do with the distributed application capabilities SSB was designed for. Instead they leverage SSB’s capability to run a stored procedure outside the context of the user connection. This capability enables to schedule execution and don’t wait for it to finish, or to schedule multiple procedures to execute in paralel. Often the developers that use these capabilities don’t care for the reliable message delivery SSB offers and many see the complex infrastructure of message types, contracts and services as a mere hurdle. Nonetheless sooner or later they have to troubleshoot an issue and then is when many of you find my blog and my articles. Troubleshooting activated procedure is not difficult, but one simply has to know where to look and what to look for.

In this entry I want to shed some light in the soul that drives the activation process: the Queue Monitors.

Read the rest of this entry »