I have some VPS servers with good connections and rather limited hardware resources, which cannot be used as an edge node to handle Kubernetes traffic by itself. For instance, a VPS server with CN2-GIA connection is perfect for Mainland Chinese visitors, but it comes only with a shy 512 MB of memory, which is a little tight for running Docker, Kubelet and Nginx Ingress Controller.
So I am trying to set up an forwarding mechanism to proxy the traffic between visitors and another more powerful Kubernetes node. A simple solution might be to run a TCP port proxy, such as nginx, haproxy, socat, or even iptables’ MASQUERADE feature from the Linux kernel. The problems with this method are:
The original visitor IP will be lost and replaced with the edge’s IP.
Surely you can use the PROXY Protocol to wrap and forward the client IPs, but that would require additional setup. For instance, Nginx Ingress Controller allows you accept only traffic with or without PROXY Protocol, but not both.
With the exception of kernel forwarding, traffic must be handled by the user space, which can increase the load on an underpowered node.
Traffic is forwarded between nodes as-is, with no additional encryption.
This will be problematic if the two nodes are not connected with a private network, and the traffic forwarded is in a plaintext protocol.
I experimented with a new solution, which overcomes all the problems mentioned above. Here is an overall illustration of the setup:
The following are the incoming and outgoing traffic flows, along with the commands used to set up forwarding:
Incoming traffic
When a client from [Client.IP] hits [Slim.Edge.Public.IP]:80, the traffic will be received by the edge server’s public network interface.
Edge server changes the destination address to [Powerful.Node.Private.IP]:80 using the kernel’s DNAT mechanism. This happens in the pre-routing stage. iptables -t nat -A PREROUTING -p tcp -m multiport --dports 80,443 -d [SLIM.EDGE.PUBLIC.IP] -j DNAT --to-destination [POWERFUL.NODE.PRIVATE.IP]
Note that unlike other proxy setups, there is no masquerading (iptables’ MASQUERADE action) in place, because we want to preserve the original source IP.
The kernel will pass on the incoming traffic to edge server’s WireGuard interface. By default, forwarding traffic is not allowed, and a forwarding rule has to be added explicitly. iptables -t filter -A FORWARD -o [SLIM.EDGE.WIREGUARD.INTERFACE] -j ACCEPT
Here [SLIM.EDGE.WIREGUARD.INTERFACE] is the slim node’s WireGuard interface, such as wg0.
Also remember to enable the net.ipv4.ip_forward toggle. This is pretty routing, so I won’t get into details.
Powerful node will receive the traffic on their WireGuard interface, and since the ingress controller (or HTTP server, or whatever daemon you are trying to forward the traffic to) is listening on the host interface with a wildcard address (0.0.0.0), the forwarded packets will be handled just like direct ones, with the client’s source IP addresses in tact.
Outgoing traffic
The powerful node will probably reply with a packet from [Powerful.Node.Private.IP]:80 to [Client.IP]:(source-port).
Use an IP rule to make sure that traffic coming out of the private interface, regardless of the destination address, will be sent to the WireGuard interface. ip rule add from [POWERFUL.NODE.PRIVATE.IP] table 1234 prio 5678
Note that the routing table ID (1234) should be set in the WireGuard configuration (Table = 1234) in order for the wg-quick script to create and fill this routing table. The priority number (5678) can be anywhere between 1 and the main rule (usually 32766).
When setting up the WireGuard tunnel, make sure to put AllowedIPs = 0.0.0.0/0 (or the IPv6 counterpart ::/0) in the edge node’s peer definition.
You can examine the generated routing table by running: ip route show table 1234, and the output should be similar to: default dev [POWERFUL.NODE.WIREGUARD.INTERFACE] scope link
The traffic will be received by the edge node, which it needs to change the source IP from [Powerful.Node.Private.IP] back to its own public IP, using the kernel’s SNAT mechanism. iptables -t nat -A POSTROUTING -p tcp -s [POWERFUL.NODE.PRIVATE.IP] -j SNAT --to-source [SLIM.EDGE.PUBLIC.IP]
From the edge node’s perspective, this packet is also a forwarded one, so another forwarding rule should be added. iptables -t filter -A FORWARD -i wg-pe-ba-la -j ACCEPT
With this setup, I can achieve the goal of receiving traffic using the slim edge node, while avoiding the issues mentioned above: The client IP is preserved during the whole process, the traffic forwarding happens entirely in the kernel space, and traffic between servers are encrypted.
However, the encryption is totally optional, so the WireGuard tunnel between servers can be easily replaced with other layer-3 tunnels like IPIP or GRE.
I have some servers that don’t come with native IPv6 connectivity, which means that in order to use the next generation protocol, they need to be tunneled by other IPv6-capable nodes over IPv4.
In the past I have exclusively gone for the Tunnel Broker service provided by Hurricane Electric. I loved their service not only because it is free and easy to set up, but also for the reasonably good quality of their tunnel, since HE is a well-known transit provider. But recently, one of my servers which I use as an Internet exit has been suffering when it tries to make connections to IPv6-enabled websites. The symptom is simple – I can ping6 some addresses but not others, and the frequency is getting higher. So, I decided to set up a private tunnel endpoint using one of my own IPv6-enabled servers.
Prerequisites
I want to mimic the Tunnel Broker service as much as possible, because it is known to work. The current service provides tunnel users with the following stuff:
“Server IPv4 Address”: The remote IPv4 tunnel endpoint, like 66.220.*.*.
“Client IPv6 Address”: An IPv6 address representing the host connecting to the tunnel, like 2001:470:c:*::2.
“Server IPv6 Address”: An IPv6 address representing the tunnel server, also used as the IPv6 gateway of the client host, like 2001:470:c:*::1.
“Routed IPv6 Prefixes”: /64 or /48 subnets given to the tunnel operator to provide IPv6 connectivity to other internal networks through the tunnel.
I made one of my IPv6-connected servers the designated tunnel server. In order to be used as such, it has the following to provide:
A public IPv4 address: I will use this as the “Server IPv4 Address”, which means my client host will connect to this endpoint over IPv4.
Three routable IPv6 addresses: The specs actually say I have 10, and I believe they would route the whole /64 to me if I set it up right. But for this particular use case, 3 is enough: *::1 is the address of the tunnel server, *::2 and *::3 as Server and Client IPv6 Address respectively.
Since I don’t have any other subnets to make routable, I don’t need to provide another routable IPv6 prefix.
Connecting tunnel client and server
We first need to make the client and server hosts communicable using their IPv6 addresses. The protocol used by Tunnel Broker, and thus my new tunnel, is Simple Internet Transition (SIT). It is supported by Linux kernel natively and quite easy to set up. In fact, the Tunnel Broker service provides users with sample client configurations depending on their preferred network management tools. Here is an example using iproute2:
modprobe ipv6
ip tunnel add sit-ipv6 mode sit remote [SERVER-IPV4] local [CLIENT-IPV4] ttl 255
ip link set sit-ipv6 up
ip addr add [CLIENT-IPV6]/127 dev sit-ipv6
ip route add ::/0 dev sit-ipv6
ip -f inet6 addr
For my configuration, the client IPv6 address is *::3, and the netmask is set to /127 to include both ends’ addresses. If one wants to persist the configuration, they can use the method provided by their operating systems. Here is the example client configuration using Netplan (used at least by Ubuntu 18.04):
The thing about SIT tunnels is that they are symmetrical, so in order to set up the server end, one need to make the following changes:
Switch the server and client IPv4 addresses, so that the one after “local” is the IPv4 address of the configured machine.
Replace CLIENT-IPV6 with SERVER-IPV6 as the interface’s IPv6 endpoint.
Remove the route / gateway definition, since the server already has an external IPv6 gateway.
By now, both the tunnel server and client hosts should be able to reach each other with their brand new IPv6 addresses. This can be verified by running ping6 [SERVER-IPV6] on the client side, and vice versa.
Forwarding Tunneled Traffic
In order for the tunneled host to actually reach the global Internet, the tunnel server has to route IPv6 traffic from and to the host.
Forwarding Outgoing Traffic
Since [SERVER-IPV6] is configured to be the IPv6 gateway on the client host, all its traffic with a remote IPv6 destination address will be sent over the tunnel to the server side. By default, a server will not take the role of routing that traffic – it will only receive traffic destined to itself. To make it also forward traffic to the next hop, we need to enable packet forwarding in the kernel parameters. This can be done by running the following as root:
This can be persisted across reboots by appending net.ipv6.conf.[SERVER-TUNNEL-INTERFACE].forwarding=1 to /etc/sysctl.conf. Note that if you have firewalls like ip6tables, you may need to configure its forwarding rules, or change the default forwarding policy to ACCEPT.
Accepting Incoming Traffic
When there is traffic coming in for the tunnel server, but has the destination address of the client host, the tunnel server’s gateway will attempt to use “Neighbor Solicitation Message” to verify its reachability. But the client host’s IPv6 address is absent on all interfaces of the server host, so it will not reply said message, causing the incoming traffic to be dropped.
In order for the tunnel server to respond to the solicitation message with a “Neighbor Advertisement Message”, we need to configure a NDP proxy for the server’s external interface. The first step is to enable NDP proxy in the Linux kernel:
This parameter can be persisted in the same way as shown in the last section. Then we have to explicitly enable NDP proxy for the client IPv6 address. Using iproute2 this can be done as:
ip -6 neigh add proxy [CLIENT-IPV6] dev [SERVER-EXTERNAL-INTERFACE]
This line means that when the external router wants to reach the client IPv6 address on the interface, the server will respond with its own address. Then, when the traffic destined for the client host arrives, the server will forward it to the tunnel interface, since we configured a /127 subnet above to include IPv6 addresses of both ends. This can be shown by observing the routing table from running ip -6 route on the server.
The command also needs to be persisted, so that client hosts will not lose connectivity after the server reboots. The way of persistence varies by the network management tool used by the server. For ifupdown the command can be written in /etc/network/interfaces; If the server is using Netplan, the location where this command goes should probably be /etc/networkd-dispatcher/routable.d, since Netplan doesn’t come with native hook support.
Summary
I would like to revisit the route an outgoing packet will go through. Let’s say a process on the client host wants to access 2001:4860:4860::8888:
According to the routing table on the client end, the traffic should be forwarded to the gateway, SERVER-IPV6.
Then it will notice that the SERVER-IPV6 address belongs to the /127 subnet on sit-ipv6 interface.
When the packet is forwarded to the sit-ipv6 tunnel interface, it will be encapsulated with an IPv4 header, and sent to the SERVER-IPV4 address. This could be across the IPv4 Internet, or a private connection if there is one.
The encapsulated packet will be received by the sit-ipv6 interface on the server’s end, and unpacked to its original IPv6 form.
Since the IPv6 destination is an external one, and we have enabled forwarding on the server, it will be routed to the external gateway according to the routing table.
When the remote server replies, the packet goes the exact opposite way back to the client host.
In this article I will show some different CDN implementations along with cases where each of them fails to bring the best performance. I am not a researcher in this area, so some of the points are based on my personal experiences.
1. Why people use CDN
Usually when people visit a website, their browsers first query the IP address from their ISP’s recursive DNS server, which in turn query the domain’s authoritative DNS. Then they will be connected to whatever IP address returned by the DNS which is usually distant (geographically far and has high latency) from them.
That is why people use CDN to solve this problem. By putting edge caching servers in different areas of the world (or in the target country) close to end-users, they can speed up the load time of their websites and improve user experience.
2. Traditional Geo-DNS setup
The most common and simple solution is to use a “Geo-DNS” service to point the same domain to different edge servers with different IP addresses (this is important). In this case, when people query the IP address of a domain, the authoritative Geo-DNS can point them to the nearest edge server, based on the users’ IP, their ISPs’ recursive DNS servers’ IP, or users’ IP provided by recursive DNS via EDNS Client Subnet (EDNS0) if supported.
This works fine in an ideal network setup, but it can easily fall apart. Here are some cases that I have come across:
(a) Unicast DNS
Sometimes Geo-DNS providers don’t use Anycast, instead provide Unicast IP addresses for different regions. The recursive DNS has no way of telling which one is closer to them, so it queries a random one, which can result in slow DNS resolution at first visit.
Example: DNSPod and CloudXNS (Popular Unicasted Geo-DNS providers in China)
(b) Bad GeoIP database
Some Geo-DNS providers don’t update their GeoIP database frequent enough, or just don’t have enough data.
Example:
(1) Amazon AWS CloudFront and Akamai don’t have servers in China for obvious reasons, but Chinese visitors are not consistently directed to nearest (Hong Kong, South Korea, Japan) locations. Sometimes a query from China can get a response of European locations, which results in ~500 ms latency.
(2) Some Geo-DNS providers in China, most notably Aliyun DNS. When both “Domestic” and “Global” records are set, they may direct Chinese users to “Global” servers.
(c) DNS different from network exit
Sometimes people may use recursive DNS servers in the network different from their actual network exit.
Example:
(1) In my university, we have mixed network exits, one in CERNET (AS4538) and one in TieTong (AS9394). Our recursive DNS has a CERNET address, so most Geo-DNS providers gives CERNET or (if the website doesn’t have CERNET servers) other networks’ addresses, for instance ChinaNet (AS4134). But our network exit is configured to use TieTong by default, so for most websites we are visiting ChinaNet servers with a TieTong network, even if they also have TieTong servers.
A more extreme case is that, in some networks they have different routing policies for TCP and UDP (which is a violation of OSI model), so when you do DNS query in UDP you have network A’s address, and when you actually connect to TCP port 80 you have network B. Magical? But true.
(2) Sometimes recursive DNS providers and/or Geo-DNS providers don’t support EDNS0. As long as either end doesn’t support it, it will not work. For instance, if user of open recursive DNS service “114DNS” (Anycasted across several Chinese networks) has a network that is not present in 114DNS’ Anycast network, and the authoritative Geo-DNS doesn’t support EDNS0, it will return the IP in the same network of 114DNS’ node, but different from the network of the user.
3. TCP Anycast setup
Some modern CDN providers use TCP Anycast technique, which means they provide a single IP address for their edge servers in multiple locations, and visitors are directed to the nearest location, decided by how they broadcast their routing tables to other networks.
Such providers include CloudFlare and MaxCDN, which use a single Anycasted IP for their edge servers across the planet. Verizon EdgeCast use a slightly different method where they provide several Anycasted IPs, each represent a geographical zone (Asia-Pacific, North America, South America, Europe).
A unified Anycasted IP solves many of the problems mentioned above, and it’s becoming harder and harder to defeat them. But here they comes:
(1) Magical routing policy (again)
Current Chinese IPv6 implementation has only one international exit (AS23911), which has two exit points: a default one in Los Angeles by HE.net (AS6939) and a premium one in Hong Kong (HKIX). When I resolve EdgeCast’s IPv6 address, I get one in 2606:2800:147::/48 network, which is Anycasted in Asia. But when I trace route to this address, the packet goes from China to Los Angeles and back to Asia, resulting in ~400 ms latency. Even if people use an Anycasted recursive DNS (like Google’s), since it has servers in Hong Kong, the result is the same. By querying the domain at OpenDNS (which doesn’t have Asian server) I get the IP in 2606:2800:11f::/48 network, which is Anycasted in North America, and the latency is only ~200 ms (same as the network exit’s).
This only happens with EdgeCast’s “Continent-based Anycast” network. CloudFlare is not affected. But it has another kind of problem.
(2) Artificial routing deterioration
CloudFlare has edge servers everywhere, including Hong Kong, Taipei, Japan, South Korea, etc. which are all very close to Chinese users. But the major Chinese ISPs’ international exit routing policy directs CloudFlare traffic to Los Angeles (ChinaNet) and San Jose (China Unicom), where they are directed to the nearest edge servers in <3 hops. They did the same thing for Softlayer’s Hong Kong locations, for some magical reasons: maybe price, maybe [censored] ;). The latency from both ISP to CloudFlare’s US west locations are 200~300 ms, where with TieTong (which use Hong Kong as international exit) the value is <100 ms.
This is obviously not CloudFlare’s fault, because they cannot control the routing policy from another AS to themselves (unless they pay the other system to do so). If your ISP is doing this, switch to another ISP; If the whole country is doing this, maybe switch to another country?
Summary
To sum it up, when you and your customers’ networks don’t have any of the quirks above, a simple Anycasted Geo-DNS solution works fine – you don’t even need a commercial CDN service. But the real networks are hard, and so far a global TCP Anycast solution is the best we can do.