Network Device Interferes with Always On Connections - microsoft/CSS_SQL_Networking_Tools GitHub Wiki

Network Device Interferes with Always-On Connections

The Players

IP Address Computer Role Listener Role
172.26.25.102 Client in Datacenter A
172.26.26.71 Always-On Listener IP Address in Data Center A Primary
172.26.121.194 Always-On Listener IP Address in Data Center B Secondary
172.26.6.54 Another Always-On Listener IP Address in Data Center B Secondary

Symptom

Several SSIS Jobs copy data to various Always-On clusters by connecting to the Listener name. Primary and secondary servers existed in separate data centers on separate subnets. Each Listener name had 2 IP addresses associated with it. The connection string used the MultiSubnetFailover=true keyword to connect to each IP address in parallel to optimize the connection speed.

After one weekend, the jobs started failing about 50% of the time with the following error message:

Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server.

Restarting the job would normally allow the job to complete.

As a temporary workaround, the Connection Managers in the jobs were configured to connect directly to the Primary node computer name rather than the Listener name. This stabilized the jobs but there would be issues if a cluster needed to be failed over.

Data Collection

Several network traces were taken but the failure was not readily apparent.

A driver BID Trace was collected to see what decisions the driver made during the failure.

BID Trace Analysis

BID Trace

In the BID trace, we can see the TcpConnection::FInit and TcpConnection::FInitForAsync API calls to connect to both the Secondary Listener IP Address (172.26.121.194 and ID 543# (yellow)) and the Primary Listener IP address (172.26.25.71 and ID 544# (green)).

Note: The order of IP addresses is random and depends on the order of IP addresses returned by the DNS API call to resolve the Listener name.

At the TcpConnection::CheckCompletedAsyncConnect (blue), we see that ID 543# (yellow), the secondary IP address "won" the race and the driver moves forward with logging into this IP address.

This is an unexpected finding. Only the Listener IP address for the Primary node should be tied to a MAC address (network card). The other IP address(es) should be floating, i.e. they are not not tied to a MAC address and therefore cannot respond to the connection request. SYN packets to the secondary should not be responded to. But they were.

TELNET Test

Since the SYN packets were responded to by an unknown device, we decided to try to eliminate SQL Server from the process, since there was no SQL Server that responded to the packets, and settled on TELNET as the client.

We were able to reproduce the issue with TELNET. In this case, we used another secondary Listener IP address, 172.26.6.54.

The correct behavior should be that TELNET cannot reach the destination IP address.
We did see this for about 1-3 minutes.
Then, TELNET was able to connect. This is the undesired behavior.
We saw that for about 1-3 minutes.
And then the behavior changed and flipped between the two behaviors every couple of minutes.

We took a network trace.

NETMON Trace of TELNET Traffic

At the top of the trace, you see the expected behavior - a SYN packet, and then it is retransmitted after 3 seconds again after and 6 more seconds before determining the connection is bad. You can see this in frames 961, 1468, 2593, and the next group of 3 packets, etc. This behavior is what keeps the driver from choosing the secondary IP address for the connection.

Starting at frame 45893 is the undesirable behavior. TELNET sends a SYN packet, and then in frame 45894, some device responds with an ACK+SYN packet and in the next packet, the TCP 3-way handshake completes. This is what caused the bad behavior seen in the BID trace, the inactive IP address somehow manages to complete the TCP 3-way handshake and the driver incorrectly thinks this is the active IP address.

A second network trace taken in the secondary's subnet shows the 3 SYN packets arriving from the client when the desired behavior is happening. These packets have the same IPV4 ID as we see in the trace above. However, when the undesirable behavior is happening, we see 3 SYN packets, as well, but they have different packet IDs. And the Time To Live value was 5 lower than the desirable case, indicating the packets originated somewhere in the middle of the network.

Hypothesis

In a further test, we used TELNET from a client in the secondary's subnet, and it never experienced the bad behavior. From the three tests, we concluded the following was going on:

Traffic Diagram

Resolution

The device hypothesized was identified a BIG-IP F5 device based on TRACERT results. An F5 technician determined that SYN Cookie Protection was enabled and would cause the behavior that was seen.

Article: K14805 - Configuring SYN cookie protection (11.4.x - 12.x) (f5.com)

The BIG-IP SYN cookie feature protects the system against SYN flood attacks. The use of SYN cookies allows the BIG-IP system to maintain connections when the SYN queue begins to fill up during an attack. When the SYN Check Activation Threshold value is reached, the BIG-IP system responds to SYN requests by sending the SYN-ACK response that contains an encoded secret back to the client. The system then discards the SYN queue entry and waits for the ACK from the client. Certain BIG-IP platforms can perform hardware SYN cookie protection, while other platforms perform software-only SYN cookie protection.

How the network device interacts with the MultiSubnetfailover traffic.

With the SYN Cookie Protection enabled, the network device sees the SYN packets to the secondary IP address and they do not get an ACK from the server. This can look like a SYN flood denial of service attack. The network device tracks how much activity there is, and after a while, it decides to respond to the packets in order to short-circuit the "attack". This behavior confuses the driver and it assumes this response is a legitimate response from the primary server.

After a while, the network device resets the counter and then the correct behavior happens, but the counter will still track the traffic and cause the problem all over again.

SYN Attack Protect should be implemented on edge devices that let in external traffic and not on internal devices.

After disabling this feature, the issue did not recur.