Services running under domain account should add dnscache service as dependencyAugust 27th, 2013
Recently I had to investigate a deadlocked Windows Server 2012 machine. Any attempt to start or stop a service on this machine would freeze in an infinite wait. I did not know about the “Analyze Wait Chain” feature in the Task Manager (new since Windows 7), but turns out is quite a life saver. This feature uses the Wait Chain Traversal debugging API and is, on the record, Matt Pietrek’s favourite Windows 7 feature. Simply using the Task Manager I was able to see than many programs were waiting on the “Services and Controller app” service, which is the SCM (the Service Control Manager):
The services.exe with PID 588 is the SCM “Services and Controller app” service. All but one blocked threads in the SCM were waiting on the thread 1072 and thread 1072 was waiting on the LSA service. I took a dump of the PID 588 process and these blocked threads have this short stack:
ntdll!ZwWaitForSingleObject ntdll!RtlpWaitOnCriticalSection ntdll!RtlpEnterCriticalSectionContended services!ScStartServiceAndDependencies services!RStartServiceW rpcrt4!Invoke ...
I'm not a Windows insider, but just by making an educated guess on the function names I can tell that this is a thread servicing an RPC request to start a service, and waiting for a Criticial Section. Knowing from the Task Manager wait chain that the blocking thread is thread 1072 (we could use other means to find the current Critical Section owner, but why bother?). Thread id 1072 stack shows that is waiting on LSA to logon a user:
... rpcrt4!NdrClientCall3 sspicli!SspirLogonUser sspicli!SspipLogonUser sspicli!LsaLogonUser sspicli!L32pLogonUser sspicli!LogonUserExExW services!ScLogonService services!ScLogonAndStartImage services!ScStartService services!ScStartMarkedServices services!ScStartServiceAndDependencies services!RStartServiceW rpcrt4!Invoke ...
Taking a dump on the lsass.exe process shows that there is only one thread active:
... rpcrt4!NdrClientCall2 sechost!RStartServiceW sechost!StartServiceW dnsapi!StartDnsServiceOnDemand dnsapi!Query_PrivateExW dnsapi!Query_Shim dnsapi!DnsQuery_UTF8 netlogon!NetpSrvOpen netlogon!NetpDcGetDcNext netlogon!NetpDcGetNameSiteIp netlogon!NetpDcGetNameIp netlogon!NetpDcGetName netlogon!DsIGetDcName netlogon!DsrGetDcNameEx2 kerberos!KerbGetKdcBinding kerberos!KerbMakeSocketCallEx kerberos!KerbMakeSocketCall kerberos!KerbGetAuthenticationTicketEx kerberos!KerbGetTicketGrantingTicket kerberos!LsaApLogonUserEx2 lsasrv!NegLogonUserEx2Worker lsasrv!NegLogonUserEx2 lsasrv!LsapCallAuthPackageForLogon lsasrv!LsapAuApiDispatchLogonUser lsasrv!SspiExLogonUser sspisrv!SspirLogonUser rpcrt4!Invoke ...
So there it is, the LSA is attempting to start the DNS Client service (aka. dnscache) in order to respond to a request to logon an user. This call goes back into SCM and blocks on the Critical Section held by thread 1072. This is a circular wait list, hence a deadlock. Our server was stuck in this state for several days before I did this investigation.
From analyzing the problem is clear that any service that runs under a domain account can run into this problem during machine startup. The domain account is required because is the fact that is a domain account that triggers the LSA to query the DNS to find the domain controller for the service account domain.
I haven't seen this problem before, so is probably quite rare. Several ducks have to align for this to happen. However, if it happens it is difficult to investigate, it does not time out and heal itself, and it mysteriously vanishes after a reboot. My recommendation is to make any service running as a domain account explicitly dependent on the dnscache service. This way the SCM will ensure the dnscache service is started before asking the LSA to create a logon token for the service attempting to start, thus eliminating the possibility of a deadlock. The downside is that stopping the dnscache service will stop your service too, which may be undesirable in some situations.