Update on July 22
Stable versions have been released in all supported branches (v1.16.13, v1.17.9 and v1.18.6) that include the fix needed. According to the change log:
Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. [ref]
Update on June 17
After 3 months, the problem has been located and solved from Kubernetes’ end. Long story short, they decided that neither Linux kernel nor Flannel required any change, instead a mark added by kube-proxy caused the kernel to double-NAT the packet, and in turn sent with a wrong checksum. A pull request has been merged to master, and soon to be backported to release branches.
To see the whole story, check out the following links:
Recently I noticed some DNS queries in my Kubernetes cluster time out, causing apps to crash. I looked into the issue, reported to the kernel network team and applied the workaround.
Symptom
My Kubernetes cluster is built with Flannel overlay network with vxlan backend. The idea is that each node (machine) gets a private IP subnet to further allocate to pods. When a cross-node packet is to be sent, it was sent to the vxlan virtual interface, encapsulated in a UDP (regardless of the original protocol) packet and routed to the other node, where it is received by another vxlan interface and extracted.*
Kubernetes clusters provide Service resource. One of the many types of Service allows you to use a single virtual IP to represent multiple pods, some times across nodes. This is implemented with kube-proxy component, which utilizes the IPVS feature in Linux Kernel.
Now, when I make a DNS query on the host (it should be the same from inside containers, but with more hops) using dig
against CoreDNS’ service IP, it always times out. It works fine if I query one of the backend pod’s IP instead.
Diagnosis
I used tcpdump
to capture the packet, and noticed that the encapsulated UDP packet had a bad UDP checksum.
06:22:23.699846 IP (tos 0x0, ttl 64, id 7598, offset 0, flags [none], proto UDP (17), length 133)
192.3.59.220.25362 > 147.135.114.20.8472: [bad udp cksum 0xd2ae -> 0x245b!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 33703, offset 0, flags [none], proto UDP (17), length 83)
172.19.192.0.13169 > 172.19.195.166.53: [udp sum ok] 41922+ [1au] A? www.google.com. ar: . OPT UDPsize=4096 (55)
Further test on the receiving end shows that the packet is transferred, but dropped on the target node. That makes it certain that the checksum is what caused the DNS query to time out with no response.
A little more Googling shows that this could be caused by “Checksum offloading“. That means if the kernel wants to send a packet out on a physical ethernet card, it can leave the checksum calculation to the card hardware. In this case, if you capture the packet from kernel, it will show a wrong checksum, since it has yet to be calculated; but, the same packet captured on the receiving end will have a different and correct checksum, calculated by the sender’s network card hardware.
Workaround
I tried to use ethtool
to disable TX (outgoing) checksum offloading on flannel.1
(vxlan virtual interface), and the query works again. So my guess is the kernel driver miscalculated / forgot to calculate the checksum with offloading turned on; when it’s off, it used kernel code to calculate the correct checksum before sending it to the actual outgoing network card.
To temporarily turn off checksum offloading:
sudo ethtool -K flannel.1 tx-checksum-ip-generic off
I used a systemd service to automatically do this after the interface appears. Note that the interface was created by flannel after kubelet is run, so you can’t simply execute it at boot time, e.g. in /etc/rc.local
.
You can use the following code to create the service /etc/systemd/system/xiaodu-flannel-tx-off.service
, then enable and start it. (The service file can be downloaded using this link.)
sudo tee /etc/systemd/system/xiaodu-flannel-tx-off.service > /dev/null << EOF
[Unit]
Description=Turn off checksum offload on flannel.1
After=sys-devices-virtual-net-flannel.1.device
[Install]
WantedBy=sys-devices-virtual-net-flannel.1.device
[Service]
Type=oneshot
ExecStart=/sbin/ethtool -K flannel.1 tx-checksum-ip-generic off
EOF
sudo systemctl enable xiaodu-flannel-tx-off
sudo systemctl start xiaodu-flannel-tx-off
For systemd >= 245, they added TransmitChecksumOffload
parameter to *.link unit. You can read the docs and try it out yourself, or just use the service above.
* If you want to learn more about how Flannel and Kubernetes networking works under the hood, I strongly suggest that you read this blog post, which gives an step-by-step demonstration of how a packet is sent from one pod to another.