Kubernetes + Flannel: UDP packets dropped for wrong checksum – Workaround

Update on July 22

Stable versions have been released in all supported branches (v1.16.13, v1.17.9 and v1.18.6) that include the fix needed. According to the change log:

Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. [ref]

Update on June 17

After 3 months, the problem has been located and solved from Kubernetes’ end. Long story short, they decided that neither Linux kernel nor Flannel required any change, instead a mark added by kube-proxy caused the kernel to double-NAT the packet, and in turn sent with a wrong checksum. A pull request has been merged to master, and soon to be backported to release branches.

To see the whole story, check out the following links:


Recently I noticed some DNS queries in my Kubernetes cluster time out, causing apps to crash. I looked into the issue, reported to the kernel network team and applied the workaround.

Symptom

My Kubernetes cluster is built with Flannel overlay network with vxlan backend. The idea is that each node (machine) gets a private IP subnet to further allocate to pods. When a cross-node packet is to be sent, it was sent to the vxlan virtual interface, encapsulated in a UDP (regardless of the original protocol) packet and routed to the other node, where it is received by another vxlan interface and extracted.*

Kubernetes clusters provide Service resource. One of the many types of Service allows you to use a single virtual IP to represent multiple pods, some times across nodes. This is implemented with kube-proxy component, which utilizes the IPVS feature in Linux Kernel.

Now, when I make a DNS query on the host (it should be the same from inside containers, but with more hops) using dig against CoreDNS’ service IP, it always times out. It works fine if I query one of the backend pod’s IP instead.

Diagnosis

I used tcpdump to capture the packet, and noticed that the encapsulated UDP packet had a bad UDP checksum.

06:22:23.699846 IP (tos 0x0, ttl 64, id 7598, offset 0, flags [none], proto UDP (17), length 133)
    192.3.59.220.25362 > 147.135.114.20.8472: [bad udp cksum 0xd2ae -> 0x245b!] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 33703, offset 0, flags [none], proto UDP (17), length 83)
    172.19.192.0.13169 > 172.19.195.166.53: [udp sum ok] 41922+ [1au] A? www.google.com. ar: . OPT UDPsize=4096 (55)

Further test on the receiving end shows that the packet is transferred, but dropped on the target node. That makes it certain that the checksum is what caused the DNS query to time out with no response.

A little more Googling shows that this could be caused by “Checksum offloading“. That means if the kernel wants to send a packet out on a physical ethernet card, it can leave the checksum calculation to the card hardware. In this case, if you capture the packet from kernel, it will show a wrong checksum, since it has yet to be calculated; but, the same packet captured on the receiving end will have a different and correct checksum, calculated by the sender’s network card hardware.

Workaround

I tried to use ethtool to disable TX (outgoing) checksum offloading on flannel.1 (vxlan virtual interface), and the query works again. So my guess is the kernel driver miscalculated / forgot to calculate the checksum with offloading turned on; when it’s off, it used kernel code to calculate the correct checksum before sending it to the actual outgoing network card.

To temporarily turn off checksum offloading:

sudo ethtool -K flannel.1 tx-checksum-ip-generic off

I used a systemd service to automatically do this after the interface appears. Note that the interface was created by flannel after kubelet is run, so you can’t simply execute it at boot time, e.g. in /etc/rc.local.

You can use the following code to create the service /etc/systemd/system/xiaodu-flannel-tx-off.service, then enable and start it. (The service file can be downloaded using this link.)

sudo tee /etc/systemd/system/xiaodu-flannel-tx-off.service > /dev/null << EOF
[Unit]
Description=Turn off checksum offload on flannel.1
After=sys-devices-virtual-net-flannel.1.device

[Install]
WantedBy=sys-devices-virtual-net-flannel.1.device

[Service]
Type=oneshot
ExecStart=/sbin/ethtool -K flannel.1 tx-checksum-ip-generic off
EOF
sudo systemctl enable xiaodu-flannel-tx-off
sudo systemctl start xiaodu-flannel-tx-off

For systemd >= 245, they added TransmitChecksumOffload parameter to *.link unit. You can read the docs and try it out yourself, or just use the service above.

* If you want to learn more about how Flannel and Kubernetes networking works under the hood, I strongly suggest that you read this blog post, which gives an step-by-step demonstration of how a packet is sent from one pod to another.

Solution: After switching to SyntaxHighlighter Evolved, all the codes are scrambled

tl;dr

If you switched from another syntax highlighting plugin to SyntaxHighlighter Evolved, and all your codes are scrambled, try running the following code as a single-file plugin.

function xiaodu_syntaxhighlighter_fix() {
	return 2;
}
add_filter('syntaxhighlighter_pre_getcodeformat', 'xiaodu_syntaxhighlighter_fix');

The easiest way is to go to Plugins – Plugin Editor, and paste the code at the bottom of any enabled plugin, maybe Hello Dolly.

Long version

After almost three years I finally started working on this blog again.

One of the first things I noticed is that the syntax highlighting plugin I used, Crayon Syntax Highlighter, is dead. Well, to its credit, the server-rendered markups it generated were… fine when I started this blog, but nowadays they just look sickening to me, especially when compared to the neat client rendering solutions.

So I went to the store and downloaded the most popular choice, SyntaxHighlighter Evolved, which uses the JavaScript library SyntaxHighlighter to perform client-side highlighting. After installing it and converting all my old code tags to their markup format, I found the highlighting to be working, but all the C++ and HTML looked screwed up.

Scrambled code
Scrambled code

As you can see, all the “<“, “>” and “&” in the code are now showing up as their HTML entities – “&lt;”, “&rt;” and “&amp;”. That is not cool, so I looked into the problem.

Looking under the hood

First thing we need to know is how the code is stored in the post. When I click on the “Text” tab in the post editor (yep, the old one… I haven’t adapted to the new blocks yet,) I found that the characters are displaying correctly.

Code in post editor
Code in post editor

Then I looked further into MySQL, and the code is stored encoded, which is fine – it can be stored in the database however it fits, as long as the final output is correct… which it isn’t.

Code in MySQL
Code in MySQL

Pinpointing the plugin code

Now that I know that the plugin did an extra encoding, I looked for “htmlspecialchars” in the plugin’s GitHub repository, and found this piece of code:

	// This function determines what version of SyntaxHighlighter was used when the post was written
	// This is because the code was stored differently for different versions of SyntaxHighlighter
	function get_code_format( $post ) {
		if ( false !== $this->codeformat )
			return $this->codeformat;
		if ( empty($post) )
			$post = new stdClass();
		if ( null !== $version = apply_filters( 'syntaxhighlighter_pre_getcodeformat', null, $post ) )
			return $version;
		$version = ( empty($post->ID) || get_post_meta( $post->ID, '_syntaxhighlighter_encoded', true ) || get_post_meta( $post->ID, 'syntaxhighlighter_encoded', true ) ) ? 2 : 1;
		return apply_filters( 'syntaxhighlighter_getcodeformat', $version, $post );
	}
	// Adds a post meta saying that HTML entities are encoded (for backwards compatibility)
	function mark_as_encoded( $post_ID, $post ) {
		if ( false == $this->encoded || 'revision' == $post->post_type )
			return;
		delete_post_meta( $post_ID, 'syntaxhighlighter_encoded' ); // Previously used
		add_post_meta( $post_ID, '_syntaxhighlighter_encoded', true, true );
	}

Apparently years ago they changed how codes are stored in the post. Now if you write and save a new post with their plugin installed, they will save the code already encoded, and insert a post meta “_syntaxhighlighter_encoded = True” to mark the post as the “new (encoded) format”.

But if you are like me who used other plugins when initially posting the code and later switched to Evolved, you are in bad luck, as they consider your post by default the “old format,” and will encode the code again in the final output.

Solution

The apparent solutions it to make the plugin think that all my posts are in the new format. I could add the same metadata to each of the posts, but luckily there is a easier way: Use the filter “syntaxhighlighter_pre_getcodeformat” (line 8 in the code above) they provided to override the result.

So I used the plugin code at the beginning to hook it. The hook function simply returns 2, which means all my posts, with or without the metadata, will be considered the already-encoded new format, so they will not be doubly encoded.

It’s been years, is that all you have to say?

OK, fair enough. So this blog may look the same, but the tech underneath it is constantly changing.

For example, all my appliances are now hosted on my own bare-metal (as opposed to cloud-vendored like GKE) globally-distributed Kubernetes cluster. Also, I have been hiding behind CloudFlare for years to avoid the haters, but now they are mostly gone (or grown up ?,) so I have been thinking of new ways to distribute my content.

All of these new stuff are exciting and worth sharing, and I will write about them soon™.

记一次杯具与恢复的全过程(安卓修改屏幕分辨率)

我手里有一个三星的安卓平板(SM-P601),分辨率是 2560×1600,像素密度 320,经常被某些傻X APP 认为不是移动设备。今天刚发现可以通过一个 wm 命令来临时修改分辨率,于是手贱改了个 640×360,然后悲剧发生了:因为我是在 Terminal Emulator 里 Root 修改的,没有开 ADB,然后屏幕小到看不全,也没法打 wm reset,重启也不管用。借助万能的谷歌,终于坎坷的恢复回去了。下面记录一下各种坑……

第一关:怎么启用 ADB?

在网上搜了一下,果然有蛋疼的老外跟我一样蠢,在没开 ADB 的情况下修改了分辨率,还改不回去了,例如 XDA 的这位仁兄,然后他找出了如何通过 Recovery 强制启用 ADB 的方法:

我用的是 TWRP,首先长按电源键重启,并按住 Home + 音量加进入 Recovery(不同设备组合键也不一样)。然后在 Mount 中挂载 System,再进入 Advanced – Terminal,并使用 vi /system/build.prop 加入以下行:

persist.service.adb.enable=1                                                    
persist.service.debuggable=1
persist.sys.usb.config=mtp,adb

如果不会使用 vi 或者觉得麻烦,可以用 File Manager 把 /system/build.prop 拷到 /sdcard,然后利用 TWRP 的 MTP 把它复制到电脑上修改,然后再拷贝回去。

重启系统后,电脑上使用 adb devices 可以看到设备了,然而后面显示 unauthorized,无法进入 adb shell……于是进入第二关。(如果你的设备版本较低无需 ADB 授权,或以前授权过这台电脑,可以直接跳到最后。)

第二关:怎么授权 ADB?

还是没办法进行屏幕操作。一般连上 ADB 之后都会有个“是否允许 USB 调试”的提示,点允许就授权好了。于是又放狗一搜,果然也有 StackOverflow 的老外解决了这个问题。

首先找到你电脑上的 adbkey.pub 公钥,位置可能在 $ANDROID_SDK_HOME 环境变量对应的位置(如果安装过 SDK)、C:\Users\用户名\.android(Windows 默认)或 ~/.android(其它系统默认)。找到之后,改名为 adb_keys,并利用 TWRP 的 MTP 传输到手机的 /sdcard(也就是内部存储)根目录。

然后回到 TWRP 的 Advanced – File Manager,找到 /sdcard/adb_keys 并复制到 /data/misc/adb 目录下。重启手机即可连接 ADB。

如何正确地修改分辨率?

在系统中进入 adb shell,在修改分辨率时要保证屏幕是关闭(锁屏)状态,所以最好不要在 Terminal Emulator 命令行中修改。

要恢复默认的分辨率,根据系统版本不同对应的命令是:

  • am display-size reset(< Android 4.3)
  • wm size reset(≥ Android 4.3)

然后如果要修改分辨率或密度(density),使用的命令是:

  • am display-size 1080x720
  • am display-density 240
  • wm size 1080x720
  • wm density 240

ngx_pagespeed module for Ubuntu’s stock nginx

I recently decided to build ngx_pagespeed for Ubuntu’s stock nginx, since nginx supports dynamic modules (as Apache did long ago) since 1.9.11. Being “dynamic” means that third-party modules don’t have to be built into the main nginx binary, instead can be separately built and dynamically load at runtime, as long as the module and main nginx are compatible, meaning their configuration parameters are similar enough.

(中文版请参见另一篇文章:为 Ubuntu 官方源的 nginx 单独编译 ngx_pagespeed 模块

I have not seen anyone else build any third-party modules separately. nginx.org official repository provides several modules as packages, but I believe they are built along with the binary they shipped. A probable reason is that even though modules built separately can be loaded, they are not guaranteed to work smoothly. I saw that nginx 1.11.5 is trying to improve compatibility of dynamic modules, but whether that works remains to be tested.

I used the latest Debian (Ubuntu) packaging tools and formats to build the package, and Launchpad PPA to host the apt repository.

You can use the apt commands shown on the page to enable the PPA and install the packages. The following is an example:

sudo add-apt-repository ppa:du9l/ngx-pagespeed
sudo apt-get update

To use this module, first install the right flavored package, then add the line:
load_module modules/ngx_pagespeed.so;
… to your /etc/nginx/nginx.conf file before reloading nginx.

Please remember that this is an experimental feature, and you should NOT use it in production. More information can be found in the PPA description below.

PPA Description

This repository contains “ngx_pagespeed” dynamic module for Ubuntu’s stock nginx packages, including all flavors available in the official repository.

* WARNING: Building dynamic modules alone is EXPERIMENTAL. It is NOT guaranteed to work by the nginx authors. Even though the module can be loaded and has been tested on my own server, I still don’t recommend using it in PRODUCTION environments. USE AT YOUR OWN RISK. *

The package names are “ngx-pagespeed-nginx-FLAVOR” where FLAVOR is core / light / full / extras, which should match your nginx flavor. Check with:
$ dpkg -l | grep nginx

The versions follow ngx_pagespeed’s latest stable versions and Ubuntu’s REL-updates (e.g. xenial-updates) nginx versions. For example, the first version available in this PPA is built with ngx_pagespeed 1.11.33.4 and xenial-updates’ nginx 1.10.0-0ubuntu0.16.04.4 versions.

To use this module, first install the right flavored package, and then add the line:
load_module modules/ngx_pagespeed.so;
… to your /etc/nginx/nginx.conf file before reloading nginx.

* NOTE: This package is linked against the pre-built “psol” binaries provided by Google, so only i386 and amd64 systems are supported for now. In the future I will update the package to build psol itself. *

 

为 Ubuntu 官方源的 nginx 单独编译 ngx_pagespeed 模块

我最近尝试为 Ubuntu 源里的 nginx 编译 ngx_pagespeed 模块。nginx 从 1.9.11 开始支持动态模块,即第三方模块不再需要编译进 nginx 主程序,而是可以动态加载,条件是编译 nginx 和模块的参数基本相同。

English version: ngx_pagespeed module for Ubuntu’s stock nginx

目前网上好像还没有提供动态模块的先例,只有 nginx.org 官方源中有提供几个模块,但应该也是跟主程序一起编译的。原因可能是这样编译的模块虽然可以加载,但不一定可以稳定运行。据我观察 nginx 在 1.11.5 版本之后有增强模块兼容性的举动,具体是否好用还不好说。

在编译模块软件包的过程中,我使用的都是 Debian (Ubuntu) 的最新编译、打包工具和格式,同时使用 Launchpad PPA 来提供 apt 源。

可以使用上面 PPA 源中的命令来启用源、安装软件包。常用的命令应该是:

sudo add-apt-repository ppa:du9l/ngx-pagespeed
sudo apt-get update

使用模块前,先安装对应于 nginx 版本,然后在 /etc/nginx/nginx.conf 配置文件中添加下面一行:
load_module modules/ngx_pagespeed.so;
最后重载(reload)nginx 配置即可。

请注意,单独编译模块是实验性的特性,请不要在生产环境中使用该模块。更多信息请参见的 PPA 的介绍。

(以下是之前文章的备份)

打包 Debian/Ubuntu 软件的感想

最近正在尝试为 Ubuntu 打包一个 .deb 软件,然后看了一遍 Ubuntu 和 Debian 的打包文档,感觉简直是复杂到爆了。同样是全功能的软件包管理器,Arch Linux 的 pacman 只需要一个 PKGBUILD 加一些可选的额外文件(例如安装脚本)就能搞定,deb 系统需要一大堆脚本来辅助“简化”打包工作,学习曲线实在太陡了。经过这一番折腾,也让我明白了平时一个简单的 apt-get 背后有多少人默默的努力……

大概记录一下为一个软件打包、发布到 Launchpad PPA 的全过程。根据具体软件可能有些步骤不同,主要参考文档是 Debian 维护人员手册(有中文)。

  1. 创建目录并下载源代码。通常建议创建一个专门的目录(例如 ~/package),下载源代码 *.tar.gz 文件,并解压缩成单独的源代码目录(~/package/hello-2.10)。
  2. 初始化 debian 目录。在刚才的源代码目录中,使用 dh_make 工具创建 debian 目录的基础结构(~/package/hello-2.10/debian),例如 dh_make -f ../hello-2.10.tar.gz
  3. 可选:将源代码针对 Debian/Ubuntu 进行必要的修改,使用 quilt 工具将修改补丁管理在 debian/patches 目录中。
  4. 按照需要修改 debian 目录中的控制文件,其中最重要的三个是:
    1. control – 管理软件包信息,包括依赖包和介绍等。
    2. rules – 管理软件包配置(configure)和构建(build)的具体步骤。使用的是 GNU Make 的格式,可以根据需要修改特定的 target。这里是水最深的,每个 target 默认调用一些 dh_xxx 的脚本,又可以通过指定 override_dh_* 名称的 target 来覆盖这些脚本的默认操作等等。
    3. changelog – 软件包版本更新信息,构建工具通过它来确定要构建的版本号。
  5. 在源代码目录中执行构造命令。这里有很多种命令可以用,例如 dpkg-buildpackage、debuild、pbuilder 等。大概都是一个高层包装(wrap)另一个底层的关系。
  6. 编译成功后,需要用自己的 PGP 密钥签名 .dsc 和 .changes 文件。debuild 之类的工具可以代劳,也可以手动用 debsign 工具。
  7. 把软件包上传到源。例如 Launchpad PPA 就要求使用 dput 工具上传,而且只传源代码包,它们用自己的服务器编译并发布 deb 二进制包。

简单来说就是这些步骤,其实每一步用到的工具都很复杂,我研究了两天才顺利打好我需要的软件包,准备测试之后再发出来。绝对黑科技,敬请期待。