Services running under domain account should add dnscache service as dependency
August 27th, 2013Recently I had to investigate a deadlocked Windows Server 2012 machine. Any attempt to start or stop a service on this machine would freeze in an infinite wait. I did not know about the “Analyze Wait Chain” feature in the Task Manager (new since Windows 7), but turns out is quite a life saver. This feature uses the Wait Chain Traversal debugging API and is, on the record, Matt Pietrek’s favourite Windows 7 feature. Simply using the Task Manager I was able to see than many programs were waiting on the “Services and Controller app” service, which is the SCM (the Service Control Manager):
The services.exe with PID 588 is the SCM “Services and Controller app” service. All but one blocked threads in the SCM were waiting on the thread 1072 and thread 1072 was waiting on the LSA service. I took a dump of the PID 588 process and these blocked threads have this short stack:
ntdll!ZwWaitForSingleObject
ntdll!RtlpWaitOnCriticalSection
ntdll!RtlpEnterCriticalSectionContended
services!ScStartServiceAndDependencies
services!RStartServiceW
rpcrt4!Invoke
...
I'm not a Windows insider, but just by making an educated guess on the function names I can tell that this is a thread servicing an RPC request to start a service, and waiting for a Criticial Section. Knowing from the Task Manager wait chain that the blocking thread is thread 1072 (we could use other means to find the current Critical Section owner, but why bother?). Thread id 1072 stack shows that is waiting on LSA to logon a user:
...
rpcrt4!NdrClientCall3
sspicli!SspirLogonUser
sspicli!SspipLogonUser
sspicli!LsaLogonUser
sspicli!L32pLogonUser
sspicli!LogonUserExExW
services!ScLogonService
services!ScLogonAndStartImage
services!ScStartService
services!ScStartMarkedServices
services!ScStartServiceAndDependencies
services!RStartServiceW
rpcrt4!Invoke
...
Taking a dump on the lsass.exe process shows that there is only one thread active:
...
rpcrt4!NdrClientCall2
sechost!RStartServiceW
sechost!StartServiceW
dnsapi!StartDnsServiceOnDemand
dnsapi!Query_PrivateExW
dnsapi!Query_Shim
dnsapi!DnsQuery_UTF8
netlogon!NetpSrvOpen
netlogon!NetpDcGetDcNext
netlogon!NetpDcGetNameSiteIp
netlogon!NetpDcGetNameIp
netlogon!NetpDcGetName
netlogon!DsIGetDcName
netlogon!DsrGetDcNameEx2
kerberos!KerbGetKdcBinding
kerberos!KerbMakeSocketCallEx
kerberos!KerbMakeSocketCall
kerberos!KerbGetAuthenticationTicketEx
kerberos!KerbGetTicketGrantingTicket
kerberos!LsaApLogonUserEx2
lsasrv!NegLogonUserEx2Worker
lsasrv!NegLogonUserEx2
lsasrv!LsapCallAuthPackageForLogon
lsasrv!LsapAuApiDispatchLogonUser
lsasrv!SspiExLogonUser
sspisrv!SspirLogonUser
rpcrt4!Invoke
...
So there it is, the LSA is attempting to start the DNS Client service (aka. dnscache) in order to respond to a request to logon an user. This call goes back into SCM and blocks on the Critical Section held by thread 1072. This is a circular wait list, hence a deadlock. Our server was stuck in this state for several days before I did this investigation.
From analyzing the problem is clear that any service that runs under a domain account can run into this problem during machine startup. The domain account is required because is the fact that is a domain account that triggers the LSA to query the DNS to find the domain controller for the service account domain.
I haven't seen this problem before, so is probably quite rare. Several ducks have to align for this to happen. However, if it happens it is difficult to investigate, it does not time out and heal itself, and it mysteriously vanishes after a reboot. My recommendation is to make any service running as a domain account explicitly dependent on the dnscache service. This way the SCM will ensure the dnscache service is started before asking the LSA to create a logon token for the service attempting to start, thus eliminating the possibility of a deadlock. The downside is that stopping the dnscache service will stop your service too, which may be undesirable in some situations.