之前我们针对 Redis 容器化,做了一些讨论: 《Redis 容器化,是不是个“软柿子”》,业界不乏相关的实践分享,KubeBlocks 也针对 Redis Cluster 做了适配并有对应的解决方案。在 Redis 容器化的过程中,KubeBlocks 遇到了哪些问题,又是如何解决的呢?今天这篇文章将带领大家一起捏一捏这个“柿子”。
背景
Redis Cluster 是 Redis 数据库的分布式解决方案,用于将数据分布在多个节点上,以提供高可用性和扩展性。它允许将大量数据分片存储在多个节点上,并自动处理数据的分片和迁移。
Redis Cluster 使用哈希槽(hash slots)的概念来管理数据的分布。数据被分成固定数量的哈希槽,每个槽都可以分配给不同的节点。每个节点负责处理一部分哈希槽中的数据。客户端可以直接连接到任意节点,而不需要中间代理。
在应用部署中,整体架构一般由后端的 redis cluster 和应用端的 smart client 共同组成。
Redis Cluster 提供了以下特性:
自动分片和数据迁移:当节点加入或离开集群时,Redis Cluster 会自动将数据迁移到正确的节点上,以保持数据的均衡分布。
高可用性:Redis Cluster 使用主从复制机制,每个主节点都有若干个从节点。当主节点发生故障时,从节点可以自动接管,从而实现高可用性。
负载均衡:Redis Cluster 在客户端和节点之间实现了自动的负载均衡。客户端可以直接连接到任意节点,并且节点之间会自动转发请求,从而实现负载均衡。
Redis Cluster 通过将数据分布在多个节点上,并提供自动的故障转移和负载均衡机制,使得应用程序可以处理大规模的数据集和高并发的访问需求。它是一个强大的分布式解决方案,常用于需要高性能和可扩展性的场景,如缓存、会话存储和实时计数等。
问题复现
Kubeblocks 很多客户对 redis cluster 都有强烈的需求,因此我们基于 kubeblocks 对 redis cluster 做了适配,在适配的过程中我们也发现了 redis cluster 在 k8s 容器场景中对一些网络标准的兼容性问题。
问题复现步骤如下:
1. 安装 kubeblocks 0.9.0
slc@slcmac kbcli % ./bin/kbcli kubeblocks list-versions --devel
VERSION RELEASE-NOTES
0.9.0-beta.8 https://github.com/apecloud/kubeblocks/releases/tag/v0.9.0-beta.8
0.9.0-beta.7 https://github.com/apecloud/kubeblocks/releases/tag/v0.9.0-beta.7
slc@slcmac kbcli % kbcli kubeblocks install --version="0.9.0-beta.8"
复制代码
2. 安装 redis-cluster addon
虽然默认安装了 redis addon,但是因为本文所述的网络适配原因,默认安装的 addon 对 redis cluster 的支持还有问题。
# 先禁用默认 addon
slc@slcmac addons % kbcli addon disable redis
# 安装分支上最新的 addon
slc@slcmac addons % git clone git@github.com:apecloud/kubeblocks-addons.git
slc@slcmac addons % cd kubeblocks-addons/addons/redis
slc@slcmac addons % helm dependency build && cd ..
slc@slcmac addons % helm install redis ./redis
slc@slcmac addons % helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
redis default 1 2024-04-15 21:29:37.953119 +0800 CST deployed redis-0.9.0 7.0.6
复制代码
为了便于复现问题,我们在 helm install redis 之前稍微修改了 addon 中的部分配置和步骤。
3. 创建 redis cluster
创建的实例采用 NodePort 模式,3 个主节点,3 个备节点。
slc@slcmac addons % helm install redisc ./redis-cluster --set mode=cluster --set nodePortEnabled=true --set redisCluster.shardCount=3
slc@slcmac addons % kg pods | grep -v job
NAME READY STATUS RESTARTS AGE
redisc-shard-hxx-1 3/3 Running 0 14m
redisc-shard-hxx-0 3/3 Running 0 14m
redisc-shard-xwz-0 3/3 Running 0 14m
redisc-shard-xwz-1 3/3 Running 0 14m
redisc-shard-5g8-0 3/3 Running 0 14m
redisc-shard-5g8-1 3/3 Running 0 14m
复制代码
可以看到 3 主备的 pod 都能成功创建,但是此时集群 Node 之间的关系还未建立 Annouce ip/port/bus-port。
redisc-shard-5g8-0
kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2
kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set re 30039
kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 32461
redisc-shard-hxx-0
kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2
kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 30182
kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 31879
redisc-shard-xwz-0
kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2
kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 31993
kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 30105
复制代码
Create Slot:
kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster ADDSLOTSRANGE 0 5461
kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster ADDSLOTSRANGE 5462 10922
kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster ADDSLOTSRANGE 10923 16383
复制代码
Cluster Meet:
登录其中一个 master 节点
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- /bin/bash
root@redisc-shard-5g8-0:/# redis-cli -a O3605v7HsS
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 0 0 connected 0-5461
发现只有自己一个节点,还需要主动 meet 其他两个节点
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster meet 172.18.0.2 30182 31879
OK
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster meet 172.18.0.2 31993 30105
OK
重新查看集群拓
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- /bin/bash
root@redisc-shard-5g8-0:/# redis-cli -a O3605v7HsS
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713324462000 0 connected 0-5461
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713324462989 2 connected 10923-16383
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713324463091 1 connected 5462-10922
复制代码
至此一个 3 节点的 master 集群正常建立。
4. join headless slave
我们使用 redisc-shard-5g8-1 这个 pod 节点作为 master redisc-shard-5g8-0 的备节点。
查看备节点上的链接,比较干净,没有到其他 master 的链接:
查看备节点连接:
root@redisc-shard-5g8-1:/# netstat -anop | grep redis
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 127.0.0.1:6379 127.0.0.1:46948 ESTABLISHED 1/redis-server *:63 keepalive (123.22/0/0)
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
备节点 headless 地址:redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 完整的 Join 命令为:
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- /bin/bash
root@redisc-shard-5g8-1:/# redis-cli -a O3605v7HsS --cluster add-node redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 172.18.0.2:30039 --cluster-slave --cluster-master-id ff935854b7626a7e4374598857d5fbe998297799
>>> Adding node redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 to cluster 172.18.0.2:30039
>>> Performing Cluster Check (using node 172.18.0.2:30039)
M: ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039
slots:[0-5461] (5462 slots) master
M: e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993
slots:[10923-16383] (5461 slots) master
M: a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182
slots:[5462-10922] (5461 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 to make it join the cluster.
Waiting for the cluster to join
>>> Configure node as replica of 172.18.0.2:30039.
[OK] New node added correctly.
复制代码
172.18.0.2:30039 为 master 节点的 annouced ip/port。
查看链接:
root@redisc-shard-5g8-1:/# netstat -anop | grep redis
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.237:48424 172.18.0.2:31879 ESTABLISHED 1/redis-server *:63 off (0.00/0/0) // master-2 announced bus port
tcp 0 0 10.42.0.237:36154 172.18.0.2:32461 ESTABLISHED 1/redis-server *:63 off (0.00/0/0) // master-1 announced bus port
tcp 0 0 10.42.0.237:33504 172.18.0.2:30039 ESTABLISHED 1/redis-server *:63 keepalive (285.22/0/0) // master-1 announced port
tcp 0 0 127.0.0.1:6379 127.0.0.1:46948 ESTABLISHED 1/redis-server *:63 keepalive (279.99/0/0) // local redis-cli
tcp 0 0 10.42.0.237:58576 172.18.0.2:30105 ESTABLISHED 1/redis-server *:63 off (0.00/0/0) // master-3 announced bus port
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
可以看到 slave 节点和其它 3 个 master 在 announced bus port 上建立了链接,并和自己的主节点额外建立了一条连接。
在备节点上查看集群拓扑,拓扑正确:
root@redisc-shard-5g8-1:/# redis-cli -a O3605v7HsS
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713327060494 0 connected 0-5461
3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 :6379@16379 myself,slave ff935854b7626a7e4374598857d5fbe998297799 0 0 0 connected
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713327060696 2 connected 10923-16383
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713327060605 1 connected 5462-10922
复制代码
在主节点上查看集群拓扑,新加备节点缺失:
root@redisc-shard-5g8-0:/# redis-cli -a O3605v7HsS
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713327106000 0 connected 0-5461
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713327107004 2 connected 10923-16383
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713327107106 1 connected 5462-10922
复制代码
在前面 add-node 的过程中,cluster meet 提示成功,但是实际上主节点并没有看到备节点,翻看 /data/running.log,发现如下错误信息:
root@redisc-shard-5g8-0:/data# grep 16379 running.log
1:M 17 Apr 2024 04:05:37.610 - Connection with Node 30e6d55c687bfc08e4a2fcd2ef586ba5458d801f at 10.42.0.1:16379 failed: Connection refused
**共重复10次**
30e6d55c687bfc08e4a2fcd2ef586ba5458d801f at 10.42.0.1:16379 failed: Connection refused
复制代码
所以这次 cluster meet 其实是失败的,原因是为何呢?
问题排查
1. 神秘的 ip
redis cluster 默认的 bus port 是 16379 = 6379 + 10000 ,如果没有显式 announce bus port,redis cluster 就会采用该地址,所以问题应该是 master 在收到 meet 请求后尝试用对端的默认 bus port(16379)回连,但是发现一直无法连接,可是备节点的 pod ip (10.42.0.237)并不是错误信息中提示的 ip(10.42.0.1),为何 master 会回连一个不一致的 ip 呢?
slc@slcmac redis % kg pods -A -o wide | grep redisc-shard-5g8-1
default redisc-shard-5g8-1 3/3 Running 0 72m 10.42.0.237 k3d-k3s-default-server-0
复制代码
继续追查,发现 10.42.0.1 原来是 k3d (我们开发环境使用的 k8s 版本) CNI0 的地址:
slc@slcmac redis % docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8f8958df3298 moby/buildkit:buildx-stable-1 "buildkitd --allow-i…" 6 weeks ago Up 6 weeks buildx_buildkit_project-v3-builder0
f8f349b2faab ghcr.io/k3d-io/k3d-proxy:5.4.6 "/bin/sh -c nginx-pr…" 6 months ago Up 3 months 80/tcp, 0.0.0.0:57830->6443/tcp k3d-k3s-default-serverlb
3e291f02144a rancher/k3s:v1.24.4-k3s1 "/bin/k3d-entrypoint…" 6 months ago Up 3 months k3d-k3s-default-server-0
slc@slcmac redis % docker exec -it 3e291f02144a /bin/sh
/ # ifconfig
cni0 Link encap:Ethernet HWaddr 32:22:34:47:9D:BF
inet addr:10.42.0.1 Bcast:10.42.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:219424018 errors:0 dropped:0 overruns:0 frame:0
TX packets:238722923 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:33805804056 (31.4 GiB) TX bytes:199941577234 (186.2 GiB)
eth0 Link encap:Ethernet HWaddr 02:42:AC:12:00:02
inet addr:172.18.0.2 Bcast:172.18.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:74602028 errors:0 dropped:0 overruns:0 frame:0
TX packets:68167266 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:39814942542 (37.0 GiB) TX bytes:17167663962 (15.9 GiB)
slc@slcmac redis % kg node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k3d-k3s-default-server-0 Ready control-plane,master 183d v1.24.4+k3s1 172.18.0.2 <none> K3s dev 5.10.104-linuxkit containerd://1.6.6-k3s1
复制代码
也就是说 10.42.* 是 k3d 默认的 pod CIDR 网段,172.18.0.2 是 k3d 唯一一个 node 的物理地址(所以看到的 node port 地址都是 172.18.0.2)。
2. 若隐若现的链路
原来是 gossip 协议(本地 16379 -> 对端 NodePort)对应的链接在目标端上做了 NAT 转换,通过 tcpdump 抓包我们定位了一条 gossip 会话链路,这个会话链路虽然被 CNI 做了 NAT 转换,但是通过 TS Val 和 ECR 信息我们还是能完整还原出来,下面我们还原的是已经建立好链接的 master-1 和 master-2 之间的 gossip 链路:
master-1 redisc-shard-5g8-0 的链接信息:
root@redisc-shard-5g8-0:/data# netstat -anop | grep redis
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 127.0.0.1:6379 127.0.0.1:46798 ESTABLISHED 1/redis-server *:63 keepalive (117.47/0/0)
tcp 0 0 10.42.0.236:58412 172.18.0.2:31879 ESTABLISHED 1/redis-server *:63 off (0.00/0/0) // 对端是 master-2 nodeport
tcp 0 0 10.42.0.236:6379 10.42.0.1:45255 ESTABLISHED 1/redis-server *:63 keepalive (118.11/0/0)
tcp 0 0 10.42.0.236:36528 172.18.0.2:30105 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.236:16379 10.42.0.1:16471 ESTABLISHED 1/redis-server *:63 keepalive (1.20/0/0)
tcp 0 0 10.42.0.236:16379 10.42.0.1:30788 ESTABLISHED 1/redis-server *:63 keepalive (0.08/0/0)
tcp 0 0 10.42.0.236:16379 10.42.0.1:20521 ESTABLISHED 1/redis-server *:63 keepalive (1.42/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
master-2 redisc-shard-hxx-0 的链接信息:
root@redisc-shard-hxx-0:/# netstat -anop | grep redis
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.232:16379 10.42.0.1:24780 ESTABLISHED 1/redis-server *:63 keepalive (0.72/0/0) // master-1 被 NAT 之后的地址
tcp 0 0 10.42.0.232:41974 172.18.0.2:30105 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.232:16379 10.42.0.1:6717 ESTABLISHED 1/redis-server *:63 keepalive (1.34/0/0)
tcp 0 0 10.42.0.232:16379 10.42.0.1:24130 ESTABLISHED 1/redis-server *:63 keepalive (0.33/0/0)
tcp 0 0 10.42.0.232:33306 172.18.0.2:32461 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 127.0.0.1:6379 127.0.0.1:46626 ESTABLISHED 1/redis-server *:63 keepalive (24.56/0/0)
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
两个链接的映射关系:
# 在 master-1 redisc-shard-5g8-0 上对 NodePort 31879(master-2 redisc-shard-hxx-0) 进行抓包:
05:40:04.817984 IP redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412 > k3d-k3s-default-server-0.31879: Flags [P.], seq 6976:9336, ack 7081, win 10027, options [nop,nop,TS val 4191410578 ecr 867568717], length 2360
05:40:04.818428 IP k3d-k3s-default-server-0.31879 > redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412: Flags [.], ack 9336, win 498, options [nop,nop,TS val 867569232 ecr 4191410578], length 0
05:40:04.819269 IP k3d-k3s-default-server-0.31879 > redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412: Flags [P.], seq 7081:9441, ack 9336, win 501, options [nop,nop,TS val 867569233 ecr 4191410578], length 2360
05:40:04.819309 IP redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412 > k3d-k3s-default-server-0.31879: Flags [.], ack 9441, win 10026, options [nop,nop,TS val 4191410580 ecr 867569233], length 0
# 在 master-2 redisc-shard-hxx-0 上对本地 Port 24780 (master-1 redisc-shard-5g8-0) 进行抓包:
05:40:04.818178 IP 10.42.0.1.24780 > redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379: Flags [P.], seq 32624:34984, ack 32937, win 10027, options [nop,nop,TS val 4191410578 ecr 867568717], length 2360
05:40:04.818371 IP redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379 > 10.42.0.1.24780: Flags [.], ack 34984, win 498, options [nop,nop,TS val 867569232 ecr 4191410578], length 0
05:40:04.819239 IP redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379 > 10.42.0.1.24780: Flags [P.], seq 32937:35297, ack 34984, win 501, options [nop,nop,TS val 867569233 ecr 4191410578], length 2360
05:40:04.819327 IP 10.42.0.1.24780 > redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379: Flags [.], ack 35297, win 10026, options [nop,nop,TS val 4191410580 ecr 867569233], length 0
复制代码
可以看出,所有的 Pod 和 NodePort 的报文在通话对端上都被 NAT 成了 CNI0 的地址 10.42.0.1。
3. 大象真白
所以到这里 meet 失败的原因也比较清楚了,slave-1 节点在没有 announce 的前提下,通过 pod ip(10.42.0.237) 去 meet master-1,meet 报文在 master-1 pod 上被 NAT 成了 10.42.0.1,master-1 使用默认的 bus port 16379 和从报文中取出的来源 ip 地址(10.42.0.1)去回连 slave-1,在连接 10.42.0.1:16379 时,由于这个节点实际并不是一个正常的 redis pod,也就不存在一个在 16379 监听的 redis-server 进程,所以会给出 connection refused 的错误。
问题修复
1. slave-1 announce & remeet
知道了原因,问题也就比较好解决了。
对于这种 meet 失败的场景,可以让 slave-1 announce ip/port/bus-port 然后再主动 join,这样在回连时会使用 announced ip 建连。
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 31309
slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 31153
# 在 redisc-shard-5g8-1 上执行 cluster nodes,可以看到使用了最新的 announced 地址和端口
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713334354116 0 connected 0-5461
# announce 之前为 :6379@16379
3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 myself,slave ff935854b7626a7e4374598857d5fbe998297799 0 0 0 connected
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713334354325 2 connected 10923-16383
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713334354532 1 connected 5462-10922
# 重新 meet master-1
127.0.0.1:6379> cluster meet 172.18.0.2 30039 32461
OK
复制代码
在 master-1 上我们能够看到 meet 前后的差别:
root@redisc-shard-5g8-0:/data# redis-cli -a O3605v7HsS
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713334463000 0 connected 0-5461
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713334463613 2 connected 10923-16383
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713334463613 1 connected 5462-10922
127.0.0.1:6379> cluster nodes
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713334506000 0 connected 0-5461
3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713334506133 0 connected
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713334506133 2 connected 10923-16383
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713334506233 1 connected 5462-10922
复制代码
可以在 master-1 上看到多了一条到 slave-1 的 gossip 链接:
root@redisc-shard-5g8-0:/data# netstat -anop | grep redis
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 127.0.0.1:6379 127.0.0.1:46798 ESTABLISHED 1/redis-server *:63 keepalive (22.34/0/0)
tcp 0 0 10.42.0.236:58412 172.18.0.2:31879 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.236:6379 10.42.0.1:45255 ESTABLISHED 1/redis-server *:63 keepalive (22.15/0/0)
tcp 0 0 10.42.0.236:43732 172.18.0.2:31153 ESTABLISHED 1/redis-server *:63 off (0.00/0/0) // to slave-1 nodeport
tcp 0 0 10.42.0.236:36528 172.18.0.2:30105 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.236:16379 10.42.0.1:16471 ESTABLISHED 1/redis-server *:63 keepalive (1.17/0/0)
tcp 0 0 10.42.0.236:16379 10.42.0.1:30788 ESTABLISHED 1/redis-server *:63 keepalive (0.97/0/0)
tcp 0 0 10.42.0.236:16379 10.42.0.1:20521 ESTABLISHED 1/redis-server *:63 keepalive (1.48/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
可以在 slave-1 上看到多了三条来自 master-1/2/3 的 gossip 链接:
root@redisc-shard-5g8-1:/# netstat -anop | grep redis
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.237:48424 172.18.0.2:31879 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.237:16379 10.42.0.1:35577 ESTABLISHED 1/redis-server *:63 keepalive (1.11/0/0) // from NAT master
tcp 0 0 10.42.0.237:36154 172.18.0.2:32461 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.237:16379 10.42.0.1:32078 ESTABLISHED 1/redis-server *:63 keepalive (0.15/0/0) // from NAT master
tcp 0 0 10.42.0.237:33504 172.18.0.2:30039 ESTABLISHED 1/redis-server *:63 keepalive (0.00/0/0)
tcp 0 0 127.0.0.1:6379 127.0.0.1:46948 ESTABLISHED 1/redis-server *:63 keepalive (0.00/0/0)
tcp 0 0 10.42.0.237:58576 172.18.0.2:30105 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 10.42.0.237:16379 10.42.0.1:35265 ESTABLISHED 1/redis-server *:63 keepalive (1.22/0/0) // from NAT master
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
这三条链接其实也是 master 通过 slave-1 的 NodePort 链接成功后在 Pod 上被 NAT 成了 CNI0 的地址。
2. slave-2 announce & meet
Annouce ip/port/bus-port:
slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2
slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 30662
slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 30960
slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- /bin/bash
复制代码
Add-node Slave-2 (这个过程会包含 meet 操作):
redis-cli -a O3605v7HsS --cluster add-node 172.18.0.2:30662 172.18.0.2:30182 --cluster-slave --cluster-master-id a54e8fa9474c620154f4c1abc9628116deb3dc7e
复制代码
在 slave-2 上查看集群拓扑:
127.0.0.1:6379> cluster nodes
3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713335442641 0 connected
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713335442328 1 connected 5462-10922
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713335442328 2 connected 10923-16383
4d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662@30960 myself,slave a54e8fa9474c620154f4c1abc9628116deb3dc7e 0 1713335442000 1 connected
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713335442641 0 connected 0-5461
复制代码
在 master-2 上查看集群拓扑:
127.0.0.1:6379> cluster nodes
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713335448690 2 connected 10923-16383
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713335448892 0 connected 0-5461
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 myself,master - 0 1713335448000 1 connected 5462-10922
4d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662@30960 slave a54e8fa9474c620154f4c1abc9628116deb3dc7e 0 1713335448998 1 connected
3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713335448794 0 connected
复制代码
3. slave-3 announce & meet
先 announce 后 add-node:
slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2
slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 30110
slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 30971
slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- /bin/bash
root@redisc-shard-xwz-1:/# redis-cli -a O3605v7HsS --cluster add-node 172.18.0.2:30110 172.18.0.2:31993 --cluster-slave --cluster-master-id e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b
>>> Adding node 172.18.0.2:30110 to cluster 172.18.0.2:31993
>>> Performing Cluster Check (using node 172.18.0.2:31993)
M: e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993
slots:[10923-16383] (5461 slots) master
M: ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039
slots:[0-5461] (5462 slots) master
1 additional replica(s)
S: 3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309
slots: (0 slots) slave
replicates ff935854b7626a7e4374598857d5fbe998297799
M: a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182
slots:[5462-10922] (5461 slots) master
1 additional replica(s)
S: 4d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662
slots: (0 slots) slave
replicates a54e8fa9474c620154f4c1abc9628116deb3dc7e
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.18.0.2:30110 to make it join the cluster.
Waiting for the cluster to join
>>> Configure node as replica of 172.18.0.2:31993.
[OK] New node added correctly.
复制代码
在任何一个 master 上查看集群拓扑:
127.0.0.1:6379> cluster nodes
e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713335724101 2 connected 10923-16383
ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713335724101 0 connected 0-5461
a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 myself,master - 0 1713335724000 1 connected 5462-10922
4d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662@30960 slave a54e8fa9474c620154f4c1abc9628116deb3dc7e 0 1713335724404 1 connected
3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713335724510 0 connected
161ff6ea42047be45d986ed8ba4505afd07096d9 172.18.0.2:30110@30971 slave e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 0 1713335724101 2 connected
复制代码
至此集群处于完整的 3 主 3 备形态。
About CNI
1. k3s + Flannel + NodePort/Pod
k3s/k3d 默认使用的 CNI 为 flannel,如上分析,flannel 会有 NAT 映射的问题。
2. k3s + Calico + NodePort
我们又测试了 k3s + Calico 的场景,Calico 使用 vxlan 来建立 Pod 网络,测试发现,当使用 NodePort 时,在 Calico 上依然存在 NAT 问题,假设我们使用的 NodePort 是 10.128.0.52:32135,在入方向上,到本地 16379 端口的通信的 NodePort (10.128.0.52)依然会被转化为 Node 所在主机 vxlan.calico 网络设备的地址(192.168.238.0)。
这是其中一个 slave 的网络连接:
root@redisc-shard-ffv-1:/# netstat -anop | grep redis
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:41800 10.128.0.52:32135 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:45578 10.128.0.52:31952 ESTABLISHED 1/redis-server *:63 keepalive (277.76/0/0) // 到远端的 NodePort
tcp 0 0 127.0.0.1:6379 127.0.0.1:45998 ESTABLISHED 1/redis-server *:63 keepalive (185.62/0/0)
tcp 0 0 192.168.32.136:53280 10.128.0.52:32675 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 192.168.238.0:8740 ESTABLISHED 1/redis-server *:63 keepalive (8.79/0/0) // 来自远端的经过 NAT 的 NodePort
tcp 0 0 192.168.32.136:16379 192.168.238.0:9617 ESTABLISHED 1/redis-server *:63 keepalive (1.70/0/0)
tcp 0 0 192.168.32.136:34040 10.128.0.52:31454 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 192.168.238.0:18110 ESTABLISHED 1/redis-server *:63 keepalive (1.82/0/0)
tcp 0 0 192.168.32.136:39006 10.128.0.52:30390 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 192.168.238.0:32651 ESTABLISHED 1/redis-server *:63 keepalive (1.57/0/0)
tcp 0 0 192.168.32.136:54986 10.128.0.52:30459 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 192.168.238.0:43310 ESTABLISHED 1/redis-server *:63 keepalive (1.83/0/0)
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
在 Node 10.128.0.52 上可以看到两个设备:
ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1460
inet 10.128.0.52 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::4001:aff:fe80:34 prefixlen 64 scopeid 0x20<link>
ether 42:01:0a:80:00:34 txqueuelen 1000 (Ethernet)
RX packets 3228477 bytes 3975395572 (3.9 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3025699 bytes 2382110168 (2.3 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
inet 192.168.238.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::64b2:cdff:fe99:7f96 prefixlen 64 scopeid 0x20<link>
ether 66:b2:cd:99:7f:96 txqueuelen 1000 (Ethernet)
RX packets 587707 bytes 714235654 (714.2 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 810205 bytes 682665081 (682.6 MB)
TX errors 0 dropped 31 overruns 0 carrier 0 collisions 0
复制代码
如果 NodePort 使用的 Node 为 Pod 所在的主机,在 Calico 中不会被 NAT。
slc@cluster-1:~$ kubectl exec -it redisc-shard-ffv-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 10.128.0.54 // 把 announced ip 设为 Pod 所在本地 Node ip
OK
slc@cluster-1:~$ kubectl exec -it redisc-shard-ffv-1 -c redis-cluster -- /bin/bash
root@redisc-shard-ffv-1:/# netstat -anop | grep redis
tcp 0 0 0.0.0.0:16379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 10.128.0.54:44757 ESTABLISHED 1/redis-server *:63 keepalive (6.92/0/0)
tcp 0 0 192.168.32.136:41800 10.128.0.52:32135 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 10.128.0.54:16772 ESTABLISHED 1/redis-server *:63 keepalive (0.64/0/0)
tcp 0 0 192.168.32.136:45578 10.128.0.52:31952 ESTABLISHED 1/redis-server *:63 keepalive (70.79/0/0)
tcp 0 0 127.0.0.1:6379 127.0.0.1:45998 ESTABLISHED 1/redis-server *:63 keepalive (0.00/0/0)
tcp 0 0 192.168.32.136:53280 10.128.0.52:32675 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 10.128.0.54:16440 ESTABLISHED 1/redis-server *:63 keepalive (8.62/0/0)
tcp 0 0 192.168.32.136:34040 10.128.0.52:31454 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 10.128.0.54:28655 ESTABLISHED 1/redis-server *:63 keepalive (0.14/0/0)
tcp 0 0 192.168.32.136:39006 10.128.0.52:30390 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:54986 10.128.0.52:30459 ESTABLISHED 1/redis-server *:63 off (0.00/0/0)
tcp 0 0 192.168.32.136:16379 10.128.0.54:29959 ESTABLISHED 1/redis-server *:63 keepalive (8.62/0/0)
tcp6 0 0 :::16379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
tcp6 0 0 :::6379 :::* LISTEN 1/redis-server *:63 off (0.00/0/0)
复制代码
所以在 Calico vxlan 方案中 NodePort 是否做 SNAT 是和 source Node 地址有关,如果是本机 Node 则不做 SNAT,如果是远端 Node 则需要做 SNAT,但是由于我们做了显式 announce,所以在 redis cluster meet 时也不会有问题。
3. k3s + Calico + Pod
如果只使用 pod ip,redis cluster 会正常 meet,集群拓扑正确。
总结
在某些 k8s 版本中,根据 CNI 的实现不同,pod 和 nodeport 可能会被 NAT 转换,NAT 转换后的 ip 和 port 无法让集群中其他角色回连,然后 meet 失败。
由于上述机制的存在,在 k8s 中创建 redis cluster,要么使用 host network;要么使用 NodePort 并显式 announce ip/port/bus-port;对于纯粹的 pod 网络 && 不显式 announce 的场景,需要杜绝 NAT,而这依赖于 CNI 的实现。
Redis cluster 的内部通信和外部通信共享了一套 ip 地址,announce ip 之后,会使用 announce ip 覆盖 pod ip 进行后续的通信,这样会导致内部的 gossip 协商过程也走了 announce 网络,这是一种不必要的浪费,所以未来的建议是内部协议链路和外部应用的数据链路分开。
但是即使把 pod ip 和 announce ip 使用分开,内部通信走 pod 网络,外部和 client 之间的数据链路走 announce 网络,也无法解决 CNI NAT 转换的问题,由于 redis cluster 回连机制的存在,对于 NAT 之后的地址是无法直接回连的,这里需要在 redis cluster 通信协议上做扩充,理想的情况是:1)内部通信:pod 网络,需要回连,带上原始的 pod ip 作为 source ip,即使经过 NAT 转换也能获取 source ip;2)外部通信:announce 网络,可以是 NodePort/LoadBalancer,不需要回连,无所谓是否 NAT。当然内部通信也可以走 NodePort 和 LoadBalancer,但是前提也是带上原始 source ip(announce ip 其实也是一种 source ip),这也是 KubeBlocks 目前的解决方案
使用 NodePort 会引入另外一个问题,当 Node Down 之后需要更新 cluster 节点的 announce ip,这个实现难度其实也不小,需要 operator 和 HA 节点的努力配合。
评论