Kubernetes + Flannel: UDP packets dropped for wrong checksum – Workaround

Recently I noticed some DNS queries in my Kubernetes cluster time out, causing apps to crash. I looked into the issue, reported to the kernel network team and applied the workaround.

Symptom

My Kubernetes cluster is built with Flannel overlay network with vxlan backend. The idea is that each node (machine) gets a private IP subnet to further allocate to pods. When a cross-node packet is to be sent, it was sent to the vxlan virtual interface, encapsulated in a UDP (regardless of the original protocol) packet and routed to the other node, where it is received by another vxlan interface and extracted.*

Kubernetes clusters provide Service resource. One of the many types of Service allows you to use a single virtual IP to represent multiple pods, some times across nodes. This is implemented with kube-proxy component, which utilizes the IPVS feature in Linux Kernel.

Now, when I make a DNS query on the host (it should be the same from inside containers, but with more hops) using dig against CoreDNS’ service IP, it always times out. It works fine if I query one of the backend pod’s IP instead.

Diagnosis

I used tcpdump to capture the packet, and noticed that the encapsulated UDP packet had a bad UDP checksum.

06:22:23.699846 IP (tos 0x0, ttl 64, id 7598, offset 0, flags [none], proto UDP (17), length 133)
    192.3.59.220.25362 > 147.135.114.20.8472: [bad udp cksum 0xd2ae -> 0x245b!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 33703, offset 0, flags [none], proto UDP (17), length 83)
    172.19.192.0.13169 > 172.19.195.166.53: [udp sum ok] 41922+ [1au] A? www.google.com. ar: . OPT UDPsize=4096 (55)

Further test on the receiving end shows that the packet is transferred, but dropped on the target node. That makes it certain that the checksum is what caused the DNS query to time out with no response.

A little more Googling shows that this could be caused by “Checksum offloading“. That means if the kernel wants to send a packet out on a physical ethernet card, it can leave the checksum calculation to the card hardware. In this case, if you capture the packet from kernel, it will show a wrong checksum, since it has yet to be calculated; but, the same packet captured on the receiving end will have a different and correct checksum, calculated by the sender’s network card hardware.

Workaround

I tried to use ethtool to disable TX (outgoing) checksum offloading on flannel.1 (vxlan virtual interface), and the query works again. So my guess is the kernel driver miscalculated / forgot to calculate the checksum with offloading turned on; when it’s off, it used kernel code to calculate the correct checksum before sending it to the actual outgoing network card.

To temporarily turn off checksum offloading:

sudo ethtool -K flannel.1 tx-checksum-ip-generic off

I used a systemd service to automatically do this after the interface appears. Note that the interface was created by flannel after kubelet is run, so you can’t simply execute it at boot time, e.g. in /etc/rc.local.

You can use the following code to create the service /etc/systemd/system/xiaodu-flannel-tx-off.service, then enable and start it. (The service file can be downloaded using this link.)

sudo tee /etc/systemd/system/xiaodu-flannel-tx-off.service > /dev/null << EOF
[Unit]
Description=Turn off checksum offload on flannel.1
After=sys-devices-virtual-net-flannel.1.device

[Install]
WantedBy=sys-devices-virtual-net-flannel.1.device

[Service]
Type=oneshot
ExecStart=/sbin/ethtool -K flannel.1 tx-checksum-ip-generic off
EOF
sudo systemctl enable xiaodu-flannel-tx-off
sudo systemctl start xiaodu-flannel-tx-off

For systemd >= 245, they added TransmitChecksumOffload parameter to *.link unit. You can read the docs and try it out yourself, or just use the service above.

* If you want to learn more about how Flannel and Kubernetes networking works under the hood, I strongly suggest that you read this blog post, which gives an step-by-step demonstration of how a packet is sent from one pod to another.

发表评论

电子邮件地址不会被公开。 必填项已用*标注