0500 Intermittent or Periodic Authentication Issue - microsoft/CSS_SQL_Networking_Tools GitHub Wiki

0500 Intermittent or Periodic Authentication Issue

0500.1 Typical Error Messages

  • Cannot generate SSPI context
  • Login failed for user '(null)'
  • Login failed for user ''
  • Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'
  • Login failed for user 'JohnDoe'
  • Login failed for user 'Contoso\JohnDoe'
  • Login failed. The login is from an untrusted domain and cannot be used with Windows authentication.

In the context of the Workflow, the word "Client" refers to the immediate client to SQL Server, e.g. in a 3-tier application, the client could be a web server.

0500.2 Moving Parts

The initial goal is to try to isolate which of the moving parts is causing the problem.

Moving Parts

0500.3 Appropriate Expectations

0500.3.1 This issue may take a while to resolve depending on the frequency that the problem occurs, whether it happens to one of many clients vs. all clients vs. application server, and whether it happens more often at set times of day, e.g. busy periods or during backups or reindexing.

0500.3.2 The most common issues are related to SQL Server performance or slow Domain Controller response. IF using NTLM, then LSASS has a bottleneck and limits how many new connections can be processed at once; additional requests get backed up and may timeout. Some causes, such as Antivirus can be difficult to prove, but are common nonetheless and should be investigated even without hard proof, if other avenues of inquiry do not show promise.

0500.4 Pre-Work

0500.4.1 Please perform the initial data collection and narrowing steps: 0100 Initial Data Collection and Scoping Questions. This will help get a macro perspective of the scope of an issue, such as whether the issue affects multiple computers or just one, or whether only those computers in a specific data center are facing issues. This can help focus the troubleshooting steps. It will also make you prepared for discussing the issue with Microsoft Support should you choose to do so.

Review the public troubleshooting documents listed in section 0015 Self-Help Articles.

0500.4.2 Make sure you understand the application architecture. Make a summary in a succinct form, similar to the below description:

  • There are two domains involved: CONTOSO and FABRIKAM.
  • The client (SPARKY.CONTOSO.COM) is Windows 2012.
  • The user (CONTOSO\JOHNDOE) runs EDGE and connects to a web server (_HTTP://WEB01.CONTOSO.COM/Accounting) using Integrated security.
  • The IIS app pool runs as (CONTOSO\WEB_SVC).
  • The web server connects to SQL Server 2014 (SQLProd01.FABRIKAM.COM\Accounting on port 1433) using the SqlClient .NET 4.6.2 Provider and delegates the user credentials to SQL Server via integrated security.
  • The SQL Server service account is FABRIKAM\SQL_SVC_01.

0500.5 Order of Troubleshooting

In general, troubleshooting should be data driven, which may give way to empirical tests in a more focused context. If the issue is very intermittent and network traces will be difficult to capture, then the empirical methods may be applied first.

Since the issue is intermittent, we can assume configuration, such as Kerberos SPNs, is basically correct.

0500.5.1 Are multiple domains or data centers involved?
If multiple domains or data centers are involved, check whether the users in the local domain/data center have a good experience while users in the other domain or data center do not. If that is the case, it could be a communication latency between data centers or between domain controllers. Use PING to check network latency. Use the RUNAS command with various users to test credential validation latency issues. These commands can eliminate SQL Server from the issue and show a more fundamental issue with the networking infrastructure or domain controller performance.

0500.5.2 SQL Server ERRORLOG The SQL Server ERRORLOG may reveal performance issues on SQL Server, such as entries indicating I/O was taking longer than 15 seconds. The SQL Performance team to have a PSSDIAG run and analyzed. You may want to do this, anyway, if the network trace reveals delays with the SQL Server responses.

The ERRORLOG may also include other domain-related errors, such as the following that indicate some sort of Active Directory performance issue:

	SSPI handshake failed with error code 0x80090311 while establishing a connection with integrated security; the connection has been closed.
	SSPI handshake failed with error code 0x80090304 while establishing a connection with integrated security; the connection has been closed.
These codes translate as follows:
	 Error -2146893039 (0x80090311): No authority could be contacted for authentication.
         Error -2146893052 (0x80090304): The Local Security Authority cannot be contacted.

This sort of issue should be troubleshot as an Active Directory issue.

0500.5.3 Examine Client System Event Log
The system event log will have various events, such as KERBEROS, LSA, and NETLOGON events, which indicate the computer was not able to connect to the domain controller for a period of time. To make them easier to find, filter on Error, Warning, and Critical events only. The event times need to be around the time of the outage. If there is a match, this would be an Active Directory issue.

In some cases, this may happen on the SQL Server. Check the logs on that machine, as well.

Source: NETLOGON
Date: 8/12/2012 8:22:16 PM
Event ID: 5719
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: SQLPROD01
Description:
This computer was not able to set up a secure session with a domain controller in domain CONTOSO due to the following: The remote procedure call was cancelled. This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.

0500.5.4 SQL Connectivity Ring Buffer
The ring buffer is a historical log of connection events on the SQL Server, which means it can be taken after an outage. See Collect the Connectivity Ring Buffer. Many events include login timers that tell you where the time is being spent. Time being spent on the network indicates a possible network or client latency. Time spent in SSL or SSPI APIs indicate potential issues with the Windows security subsystem. Enqueued time indicates a SQL performance issue.

0500.5.5 Collect Network Traces on Client and Server
Collect a Network Trace

0500.5.6 Run SQL Server Network Analyzer
SQL Server Network Analyzer will read your traces [into memory] and generate a report of suspected problems and a CSV file providing connection details that you can easily filter and sort in Excel. This will give you the shape of the issue, e.g. do all failures happen at the same time, in clumps, or are they randomly distributed or regularly distributed. It will also make lining up conversations in the client and server traces a lot easier.

You can also tell from the Login Progress column (NC NR flags), whether the NTLM Challenge/Response packets were used, indicating use of NTLM rather than Kerberos for the connection. You will often see this in Login Failed for 'NT AUTHORITY\ANONYMOUS LOGON' errors.

Once you have identified matched conversations in the client and server trace, open them in NETMON or Wireshark and compare how they end.

0500.5.7 Client is using NTLM logins
If the network trace is showing that the client is making all (or most) of the connections using NTLM credentials, then perform the following checks:

  1. Examine the NTLM Response packets and look at the User and Domain names. If blank, it's a Windows Security/Active Directory issue.
  2. On the SQL Server machine, check the MaxConcurrentApi registry value. Windows member servers only issue up to two concurrent NTLM authentication requests by default. Windows Domain Controllers only support one concurrent authentication request per session with a remote (user) domain controller. To increase:
	HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters
	
	Value Name: MaxConcurrentApi
	Data Type: REG_DWORD
	Value: between 0 and 10. Windows 2008 R2 maximum value is 150
	
	Restart the NETLOGON service.

In PERFMON, you can also see this in the NetLogon counters, specifically NETLOGON Semaphore Waiters and NETLOGON Semaphore Timeouts.

Perfmon Counters

  1. If possible, set an SPN for SQL Server and move to Kerberos authentication, which does not have this bottleneck.

0500.5.8 Both the SQL trace and Client trace agree the issue is on the server
If both network traces show a delay or non-response on the server, then there are some actions we can take:

  • Check for SQL Server performance issues, such as thread starvation or blocking, etc. Long running queries and high MAXDOP settings can also cause thread starvation. The SQL ERRORLOG may show IO operations taking longer than 15 seconds. This is a good indicator of a server performance issue.

  • If the server is low on available threads, you may see a pattern of SYN packets from the client followed by an immediate ACK+RESET from the server. This sequence may happen 3 times before the client gives up.

0500.5.9 Connection Pooling
Connection Pooling (actually, the lack thereof) can be a bad offender in intermittent logion failure issues. Not only can it run the client out of out-bound ports, but it can also overload the server and cause the server to reject incoming connection requests or flood a poorly performing domain controller. The best thing to do is to have the application developer use Connection Pooling in their application. It is ON by default in .NET and in IIS applications, so it's possible it may have been turned it off for some reason. If the application uses custom pooling code, this is highly discouraged as almost all custom pooling implementations we have run into have issues; it is generally better to use the built-in mechanism.

Lack of connection pooling will show as a large number of TIME_WAIT status codes in the NETSTAT output compared to ESTABLISHED connections.

If connection pooling is being used, conversations in the network trace should be quite long. You can use the CSV file generated by SQL Server Network Analyzer to sort and filter by Protocol and Frames. For pooled connections, you often won't see the beginning or end frames if the network capture was less than half an hour. If there are many, many conversations shorter than 30 frames from SYN packet to ACK+FIN packet, this is indicative of non-pooled connections. If these are mixed in with a few longer conversations, suspect background non-pooled connections as discussed in the NETSTAT section above.

The Ephemeral Port report from SQL Network Analyzer will show the number of new connections over the lifetime of the trace. You can judge connection stress by the number of connections per second.

0500.5.10 NETSTAT
Run an Administrative command prompt and then: NETSTAT -abon > c:\ports.txt

The file will contain a list of all inbound ports and all outgoing ports, their numbers, and Process ID and names of applications owning the ports. You can use this to see the worst offenders and whether the port limit is being reached. Divide the number of lines by 2 to get an approximate first estimate of port usage.

0500.5.11 TcpTimedWaitDelay and MaxUserPort
If application is running out of outbound ports and the you cannot make any changes to the application right away, you can decrease the TcpTimedWaitDelay from 240 to as low as 30 seconds, thus allowing outbound ports to recycle faster. For windows 2003 and later, you can also increase the MaxUserPort setting. For Windows Vista and later, you set this via the NETSH command per https://support.microsoft.com/en-us/help/929851/the-default-dynamic-port-range-for-tcp-ip-has-changed-in-windows-vista. This course of action does not eliminate the inefficiencies of not pooling connections or non-pooled background connections and we highly recommend taking advantage of connection pooling where possible.

Miscellaneous

0600.6.1 More on Ephemeral Ports
Running out of ephemeral ports is a relatively common cause of intermittent connection timeouts, especially if you do not see the SYN packet go on the wire.

For incoming requests on the server, ports, such as 80 or 1433, etc., can take up to 64K incoming connections per client IP address and are generally "unlimited" for all practical purposes, though SQL Server does have an absolute upper bound of 32K incoming connections. For outbound connections on the client machine, the number of ports is limited and is shared between all server connections.

For Windows 2003 and XP, the default outbound port range is from port 1025 to 5000 (3975 ports).

For windows Vista, 2008, and later, the default range is from port 49152 to 65535 (2^16 = 16384 ports).

Normally, ports are held for 4 minutes (240 seconds) by the Windows operating system before they are recycled and allowed to be re-used by applications. The reason for this is to deliberate or accidental prevent port spoofing. Because of this delay, on Windows 2003, a client application can make just 17 connections/sec to SQL Server and the outbound port range will be exhausted in just under 4 minutes. For windows Vista, that number rises to 68 connections/sec.

For web applications, such as IIS, there could be one outgoing port to SQL Server per HTTP client if integrated security is being used in the web site. For a busy web server, running out of outbound ports is a very real possibility when the load is high.

0600.6.2 Low Kernel Memory on the SQL Server Machine In SQL Server Management Studio, check Maximum Server Memory in Server Properties pane. The default is to 2147483647MB, which means the server can starve the OS of memory. It's best to set this to about 4GB-8GB less than the physical memory on the machine; less if there are multiple instances or IIS or some other app server also runs on the machine.