写点什么

Redis Cluster on K8s 大揭密

作者:小猿姐
  • 2024-05-07
    浙江
  • 本文字数:18778 字

    阅读完需:约 62 分钟

之前我们针对 Redis 容器化,做了一些讨论: 《Redis 容器化,是不是个“软柿子”》,业界不乏相关的实践分享,KubeBlocks 也针对 Redis Cluster 做了适配并有对应的解决方案。在 Redis 容器化的过程中,KubeBlocks 遇到了哪些问题,又是如何解决的呢?今天这篇文章将带领大家一起捏一捏这个“柿子”。


背景

Redis Cluster 是 Redis 数据库的分布式解决方案,用于将数据分布在多个节点上,以提供高可用性和扩展性。它允许将大量数据分片存储在多个节点上,并自动处理数据的分片和迁移。


Redis Cluster 使用哈希槽(hash slots)的概念来管理数据的分布。数据被分成固定数量的哈希槽,每个槽都可以分配给不同的节点。每个节点负责处理一部分哈希槽中的数据。客户端可以直接连接到任意节点,而不需要中间代理。


在应用部署中,整体架构一般由后端的 redis cluster 和应用端的 smart client 共同组成。


Redis Cluster 提供了以下特性:


  1. 自动分片和数据迁移:当节点加入或离开集群时,Redis Cluster 会自动将数据迁移到正确的节点上,以保持数据的均衡分布。

  2. 高可用性:Redis Cluster 使用主从复制机制,每个主节点都有若干个从节点。当主节点发生故障时,从节点可以自动接管,从而实现高可用性。

  3. 负载均衡:Redis Cluster 在客户端和节点之间实现了自动的负载均衡。客户端可以直接连接到任意节点,并且节点之间会自动转发请求,从而实现负载均衡。


Redis Cluster 通过将数据分布在多个节点上,并提供自动的故障转移和负载均衡机制,使得应用程序可以处理大规模的数据集和高并发的访问需求。它是一个强大的分布式解决方案,常用于需要高性能和可扩展性的场景,如缓存、会话存储和实时计数等。

问题复现

Kubeblocks 很多客户对 redis cluster 都有强烈的需求,因此我们基于 kubeblocks 对 redis cluster 做了适配,在适配的过程中我们也发现了 redis cluster 在 k8s 容器场景中对一些网络标准的兼容性问题。


问题复现步骤如下:

1. 安装 kubeblocks 0.9.0

slc@slcmac kbcli % ./bin/kbcli kubeblocks list-versions --develVERSION         RELEASE-NOTES0.9.0-beta.8    https://github.com/apecloud/kubeblocks/releases/tag/v0.9.0-beta.80.9.0-beta.7    https://github.com/apecloud/kubeblocks/releases/tag/v0.9.0-beta.7slc@slcmac kbcli % kbcli kubeblocks install --version="0.9.0-beta.8"
复制代码

2. 安装 redis-cluster addon

虽然默认安装了 redis addon,但是因为本文所述的网络适配原因,默认安装的 addon 对 redis cluster 的支持还有问题。


# 先禁用默认 addonslc@slcmac addons % kbcli addon disable redis# 安装分支上最新的 addonslc@slcmac addons % git clone git@github.com:apecloud/kubeblocks-addons.gitslc@slcmac addons % cd kubeblocks-addons/addons/redis slc@slcmac addons % helm dependency build && cd ..slc@slcmac addons % helm install redis ./redisslc@slcmac addons % helm listNAME          NAMESPACE        REVISION        UPDATED                                     STATUS          CHART                      APP VERSIONredis         default          1               2024-04-15 21:29:37.953119 +0800 CST        deployed        redis-0.9.0                7.0.6
复制代码


为了便于复现问题,我们在 helm install redis 之前稍微修改了 addon 中的部分配置和步骤。


3. 创建 redis cluster

创建的实例采用 NodePort 模式,3 个主节点,3 个备节点。


slc@slcmac addons % helm install redisc ./redis-cluster --set mode=cluster --set nodePortEnabled=true --set redisCluster.shardCount=3slc@slcmac addons % kg pods | grep -v jobNAME                                           READY   STATUS    RESTARTS   AGEredisc-shard-hxx-1                             3/3     Running   0          14mredisc-shard-hxx-0                             3/3     Running   0          14mredisc-shard-xwz-0                             3/3     Running   0          14mredisc-shard-xwz-1                             3/3     Running   0          14mredisc-shard-5g8-0                             3/3     Running   0          14mredisc-shard-5g8-1                             3/3     Running   0          14m
复制代码


可以看到 3 主备的 pod 都能成功创建,但是此时集群 Node 之间的关系还未建立 Annouce ip/port/bus-port。


redisc-shard-5g8-0kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set re 30039kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 32461redisc-shard-hxx-0kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 30182kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 31879redisc-shard-xwz-0kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 31993kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 30105
复制代码


Create Slot:


kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster ADDSLOTSRANGE 0 5461kubectl exec -it redisc-shard-hxx-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster ADDSLOTSRANGE 5462 10922kubectl exec -it redisc-shard-xwz-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster ADDSLOTSRANGE 10923 16383
复制代码


Cluster Meet:


登录其中一个 master 节点slc@slcmac redis % kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- /bin/bashroot@redisc-shard-5g8-0:/# redis-cli -a O3605v7HsS127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 0 0 connected 0-5461发现只有自己一个节点,还需要主动 meet 其他两个节点slc@slcmac redis %  kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster meet 172.18.0.2 30182 31879OKslc@slcmac redis %  kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- redis-cli -a O3605v7HsS cluster meet 172.18.0.2 31993 30105OK重新查看集群拓slc@slcmac redis % kubectl exec -it redisc-shard-5g8-0 -c redis-cluster -- /bin/bashroot@redisc-shard-5g8-0:/# redis-cli -a O3605v7HsS127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713324462000 0 connected 0-5461e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713324462989 2 connected 10923-16383a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713324463091 1 connected 5462-10922
复制代码


至此一个 3 节点的 master 集群正常建立。

4. join headless slave

我们使用 redisc-shard-5g8-1 这个 pod 节点作为 master redisc-shard-5g8-0 的备节点。


查看备节点上的链接,比较干净,没有到其他 master 的链接:


查看备节点连接:root@redisc-shard-5g8-1:/# netstat -anop | grep redistcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 127.0.0.1:6379          127.0.0.1:46948         ESTABLISHED 1/redis-server *:63  keepalive (123.22/0/0)tcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


备节点 headless 地址:redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 完整的 Join 命令为:


slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- /bin/bashroot@redisc-shard-5g8-1:/# redis-cli -a O3605v7HsS --cluster add-node redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 172.18.0.2:30039 --cluster-slave --cluster-master-id ff935854b7626a7e4374598857d5fbe998297799>>> Adding node redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 to cluster 172.18.0.2:30039>>> Performing Cluster Check (using node 172.18.0.2:30039)M: ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039   slots:[0-5461] (5462 slots) masterM: e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993   slots:[10923-16383] (5461 slots) masterM: a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182   slots:[5462-10922] (5461 slots) master[OK] All nodes agree about slots configuration.>>> Check for open slots...>>> Check slots coverage...[OK] All 16384 slots covered.>>> Send CLUSTER MEET to node redisc-shard-5g8-1.redisc-shard-5g8-headless:6379 to make it join the cluster.Waiting for the cluster to join
>>> Configure node as replica of 172.18.0.2:30039.[OK] New node added correctly.
复制代码


172.18.0.2:30039 为 master 节点的 annouced ip/port。


查看链接:


root@redisc-shard-5g8-1:/# netstat -anop | grep redistcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.237:48424       172.18.0.2:31879        ESTABLISHED 1/redis-server *:63  off (0.00/0/0) // master-2 announced bus porttcp        0      0 10.42.0.237:36154       172.18.0.2:32461        ESTABLISHED 1/redis-server *:63  off (0.00/0/0) // master-1 announced bus porttcp        0      0 10.42.0.237:33504       172.18.0.2:30039        ESTABLISHED 1/redis-server *:63  keepalive (285.22/0/0) // master-1 announced porttcp        0      0 127.0.0.1:6379          127.0.0.1:46948         ESTABLISHED 1/redis-server *:63  keepalive (279.99/0/0) // local redis-clitcp        0      0 10.42.0.237:58576       172.18.0.2:30105        ESTABLISHED 1/redis-server *:63  off (0.00/0/0) // master-3 announced bus porttcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


可以看到 slave 节点和其它 3 个 master 在 announced bus port 上建立了链接,并和自己的主节点额外建立了一条连接。


在备节点上查看集群拓扑,拓扑正确:


root@redisc-shard-5g8-1:/# redis-cli -a O3605v7HsS127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713327060494 0 connected 0-54613a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 :6379@16379 myself,slave ff935854b7626a7e4374598857d5fbe998297799 0 0 0 connectede4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713327060696 2 connected 10923-16383a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713327060605 1 connected 5462-10922
复制代码


在主节点上查看集群拓扑,新加备节点缺失:


root@redisc-shard-5g8-0:/# redis-cli -a O3605v7HsS127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713327106000 0 connected 0-5461e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713327107004 2 connected 10923-16383a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713327107106 1 connected 5462-10922
复制代码


在前面 add-node 的过程中,cluster meet 提示成功,但是实际上主节点并没有看到备节点,翻看 /data/running.log,发现如下错误信息:


root@redisc-shard-5g8-0:/data# grep 16379 running.log1:M 17 Apr 2024 04:05:37.610 - Connection with Node 30e6d55c687bfc08e4a2fcd2ef586ba5458d801f at 10.42.0.1:16379 failed: Connection refused**共重复10次**30e6d55c687bfc08e4a2fcd2ef586ba5458d801f at 10.42.0.1:16379 failed: Connection refused
复制代码


所以这次 cluster meet 其实是失败的,原因是为何呢?

问题排查

1. 神秘的 ip

redis cluster 默认的 bus port 是 16379 = 6379 + 10000 ,如果没有显式 announce bus port,redis cluster 就会采用该地址,所以问题应该是 master 在收到 meet 请求后尝试用对端的默认 bus port(16379)回连,但是发现一直无法连接,可是备节点的 pod ip (10.42.0.237)并不是错误信息中提示的 ip(10.42.0.1),为何 master 会回连一个不一致的 ip 呢?


slc@slcmac redis %  kg pods -A -o wide | grep redisc-shard-5g8-1default       redisc-shard-5g8-1                             3/3     Running     0              72m    10.42.0.237   k3d-k3s-default-server-0
复制代码


继续追查,发现 10.42.0.1 原来是 k3d (我们开发环境使用的 k8s 版本) CNI0 的地址:


slc@slcmac redis % docker psCONTAINER ID   IMAGE                            COMMAND                  CREATED        STATUS        PORTS                             NAMES8f8958df3298   moby/buildkit:buildx-stable-1    "buildkitd --allow-i…"   6 weeks ago    Up 6 weeks                                      buildx_buildkit_project-v3-builder0f8f349b2faab   ghcr.io/k3d-io/k3d-proxy:5.4.6   "/bin/sh -c nginx-pr…"   6 months ago   Up 3 months   80/tcp, 0.0.0.0:57830->6443/tcp   k3d-k3s-default-serverlb3e291f02144a   rancher/k3s:v1.24.4-k3s1         "/bin/k3d-entrypoint…"   6 months ago   Up 3 months                                     k3d-k3s-default-server-0slc@slcmac redis % docker exec -it 3e291f02144a /bin/sh/ # ifconfigcni0      Link encap:Ethernet  HWaddr 32:22:34:47:9D:BF          inet addr:10.42.0.1  Bcast:10.42.0.255  Mask:255.255.255.0          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1          RX packets:219424018 errors:0 dropped:0 overruns:0 frame:0          TX packets:238722923 errors:0 dropped:0 overruns:0 carrier:0          collisions:0 txqueuelen:1000          RX bytes:33805804056 (31.4 GiB)  TX bytes:199941577234 (186.2 GiB)
eth0 Link encap:Ethernet HWaddr 02:42:AC:12:00:02 inet addr:172.18.0.2 Bcast:172.18.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:74602028 errors:0 dropped:0 overruns:0 frame:0 TX packets:68167266 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:39814942542 (37.0 GiB) TX bytes:17167663962 (15.9 GiB)slc@slcmac redis % kg node -o wideNAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIMEk3d-k3s-default-server-0 Ready control-plane,master 183d v1.24.4+k3s1 172.18.0.2 <none> K3s dev 5.10.104-linuxkit containerd://1.6.6-k3s1
复制代码


也就是说 10.42.* 是 k3d 默认的 pod CIDR 网段,172.18.0.2 是 k3d 唯一一个 node 的物理地址(所以看到的 node port 地址都是 172.18.0.2)。

2. 若隐若现的链路

原来是 gossip 协议(本地 16379 -> 对端 NodePort)对应的链接在目标端上做了 NAT 转换,通过 tcpdump 抓包我们定位了一条 gossip 会话链路,这个会话链路虽然被 CNI 做了 NAT 转换,但是通过 TS Val 和 ECR 信息我们还是能完整还原出来,下面我们还原的是已经建立好链接的 master-1 和 master-2 之间的 gossip 链路:


master-1 redisc-shard-5g8-0 的链接信息:


root@redisc-shard-5g8-0:/data# netstat -anop | grep redistcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 127.0.0.1:6379          127.0.0.1:46798         ESTABLISHED 1/redis-server *:63  keepalive (117.47/0/0)tcp        0      0 10.42.0.236:58412       172.18.0.2:31879        ESTABLISHED 1/redis-server *:63  off (0.00/0/0) // 对端是 master-2 nodeporttcp        0      0 10.42.0.236:6379        10.42.0.1:45255         ESTABLISHED 1/redis-server *:63  keepalive (118.11/0/0)tcp        0      0 10.42.0.236:36528       172.18.0.2:30105        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.236:16379       10.42.0.1:16471         ESTABLISHED 1/redis-server *:63  keepalive (1.20/0/0)tcp        0      0 10.42.0.236:16379       10.42.0.1:30788         ESTABLISHED 1/redis-server *:63  keepalive (0.08/0/0)tcp        0      0 10.42.0.236:16379       10.42.0.1:20521         ESTABLISHED 1/redis-server *:63  keepalive (1.42/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


master-2 redisc-shard-hxx-0 的链接信息:


root@redisc-shard-hxx-0:/# netstat -anop | grep redistcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.232:16379       10.42.0.1:24780         ESTABLISHED 1/redis-server *:63  keepalive (0.72/0/0) // master-1 被 NAT 之后的地址tcp        0      0 10.42.0.232:41974       172.18.0.2:30105        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.232:16379       10.42.0.1:6717          ESTABLISHED 1/redis-server *:63  keepalive (1.34/0/0)tcp        0      0 10.42.0.232:16379       10.42.0.1:24130         ESTABLISHED 1/redis-server *:63  keepalive (0.33/0/0)tcp        0      0 10.42.0.232:33306       172.18.0.2:32461        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 127.0.0.1:6379          127.0.0.1:46626         ESTABLISHED 1/redis-server *:63  keepalive (24.56/0/0)tcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


两个链接的映射关系:


# 在 master-1 redisc-shard-5g8-0 上对 NodePort 31879(master-2 redisc-shard-hxx-0) 进行抓包:05:40:04.817984 IP redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412 > k3d-k3s-default-server-0.31879: Flags [P.], seq 6976:9336, ack 7081, win 10027, options [nop,nop,TS val 4191410578 ecr 867568717], length 236005:40:04.818428 IP k3d-k3s-default-server-0.31879 > redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412: Flags [.], ack 9336, win 498, options [nop,nop,TS val 867569232 ecr 4191410578], length 005:40:04.819269 IP k3d-k3s-default-server-0.31879 > redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412: Flags [P.], seq 7081:9441, ack 9336, win 501, options [nop,nop,TS val 867569233 ecr 4191410578], length 236005:40:04.819309 IP redisc-shard-5g8-0.redisc-shard-5g8-headless.default.svc.cluster.local.58412 > k3d-k3s-default-server-0.31879: Flags [.], ack 9441, win 10026, options [nop,nop,TS val 4191410580 ecr 867569233], length 0
# 在 master-2 redisc-shard-hxx-0 上对本地 Port 24780 (master-1 redisc-shard-5g8-0) 进行抓包: 05:40:04.818178 IP 10.42.0.1.24780 > redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379: Flags [P.], seq 32624:34984, ack 32937, win 10027, options [nop,nop,TS val 4191410578 ecr 867568717], length 236005:40:04.818371 IP redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379 > 10.42.0.1.24780: Flags [.], ack 34984, win 498, options [nop,nop,TS val 867569232 ecr 4191410578], length 005:40:04.819239 IP redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379 > 10.42.0.1.24780: Flags [P.], seq 32937:35297, ack 34984, win 501, options [nop,nop,TS val 867569233 ecr 4191410578], length 236005:40:04.819327 IP 10.42.0.1.24780 > redisc-shard-hxx-0.redisc-shard-hxx-headless.default.svc.cluster.local.16379: Flags [.], ack 35297, win 10026, options [nop,nop,TS val 4191410580 ecr 867569233], length 0
复制代码


可以看出,所有的 Pod 和 NodePort 的报文在通话对端上都被 NAT 成了 CNI0 的地址 10.42.0.1。

3. 大象真白

所以到这里 meet 失败的原因也比较清楚了,slave-1 节点在没有 announce 的前提下,通过 pod ip(10.42.0.237) 去 meet master-1,meet 报文在 master-1 pod 上被 NAT 成了 10.42.0.1,master-1 使用默认的 bus port 16379 和从报文中取出的来源 ip 地址(10.42.0.1)去回连 slave-1,在连接 10.42.0.1:16379 时,由于这个节点实际并不是一个正常的 redis pod,也就不存在一个在 16379 监听的 redis-server 进程,所以会给出 connection refused 的错误。

问题修复

1. slave-1 announce & remeet

知道了原因,问题也就比较好解决了。


对于这种 meet 失败的场景,可以让 slave-1 announce ip/port/bus-port 然后再主动 join,这样在回连时会使用 announced ip 建连。


slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 31309slc@slcmac redis % kubectl exec -it redisc-shard-5g8-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 31153
# 在 redisc-shard-5g8-1 上执行 cluster nodes,可以看到使用了最新的 announced 地址和端口127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713334354116 0 connected 0-5461# announce 之前为 :6379@163793a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 myself,slave ff935854b7626a7e4374598857d5fbe998297799 0 0 0 connectede4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713334354325 2 connected 10923-16383a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713334354532 1 connected 5462-10922
# 重新 meet master-1127.0.0.1:6379> cluster meet 172.18.0.2 30039 32461OK
复制代码


在 master-1 上我们能够看到 meet 前后的差别:


root@redisc-shard-5g8-0:/data# redis-cli -a O3605v7HsSWarning: Using a password with '-a' or '-u' option on the command line interface may not be safe.127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713334463000 0 connected 0-5461e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713334463613 2 connected 10923-16383a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713334463613 1 connected 5462-10922127.0.0.1:6379> cluster nodesff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 myself,master - 0 1713334506000 0 connected 0-54613a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713334506133 0 connectede4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713334506133 2 connected 10923-16383a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713334506233 1 connected 5462-10922
复制代码


可以在 master-1 上看到多了一条到 slave-1 的 gossip 链接:


root@redisc-shard-5g8-0:/data# netstat -anop | grep redistcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 127.0.0.1:6379          127.0.0.1:46798         ESTABLISHED 1/redis-server *:63  keepalive (22.34/0/0)tcp        0      0 10.42.0.236:58412       172.18.0.2:31879        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.236:6379        10.42.0.1:45255         ESTABLISHED 1/redis-server *:63  keepalive (22.15/0/0)tcp        0      0 10.42.0.236:43732       172.18.0.2:31153        ESTABLISHED 1/redis-server *:63  off (0.00/0/0) // to slave-1 nodeporttcp        0      0 10.42.0.236:36528       172.18.0.2:30105        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.236:16379       10.42.0.1:16471         ESTABLISHED 1/redis-server *:63  keepalive (1.17/0/0)tcp        0      0 10.42.0.236:16379       10.42.0.1:30788         ESTABLISHED 1/redis-server *:63  keepalive (0.97/0/0)tcp        0      0 10.42.0.236:16379       10.42.0.1:20521         ESTABLISHED 1/redis-server *:63  keepalive (1.48/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


可以在 slave-1 上看到多了三条来自 master-1/2/3 的 gossip 链接:


root@redisc-shard-5g8-1:/# netstat -anop | grep redistcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.237:48424       172.18.0.2:31879        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.237:16379       10.42.0.1:35577         ESTABLISHED 1/redis-server *:63  keepalive (1.11/0/0) // from NAT mastertcp        0      0 10.42.0.237:36154       172.18.0.2:32461        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.237:16379       10.42.0.1:32078         ESTABLISHED 1/redis-server *:63  keepalive (0.15/0/0) // from NAT mastertcp        0      0 10.42.0.237:33504       172.18.0.2:30039        ESTABLISHED 1/redis-server *:63  keepalive (0.00/0/0)tcp        0      0 127.0.0.1:6379          127.0.0.1:46948         ESTABLISHED 1/redis-server *:63  keepalive (0.00/0/0)tcp        0      0 10.42.0.237:58576       172.18.0.2:30105        ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 10.42.0.237:16379       10.42.0.1:35265         ESTABLISHED 1/redis-server *:63  keepalive (1.22/0/0) // from NAT mastertcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


这三条链接其实也是 master 通过 slave-1 的 NodePort 链接成功后在 Pod 上被 NAT 成了 CNI0 的地址。

2. slave-2 announce & meet

Annouce ip/port/bus-port:


slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 30662slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 30960slc@slcmac redis % kubectl exec -it redisc-shard-hxx-1 -c redis-cluster -- /bin/bash
复制代码


Add-node Slave-2 (这个过程会包含 meet 操作):


redis-cli -a O3605v7HsS --cluster add-node 172.18.0.2:30662 172.18.0.2:30182 --cluster-slave --cluster-master-id a54e8fa9474c620154f4c1abc9628116deb3dc7e
复制代码


在 slave-2 上查看集群拓扑:


127.0.0.1:6379> cluster nodes3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713335442641 0 connecteda54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 master - 0 1713335442328 1 connected 5462-10922e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713335442328 2 connected 10923-163834d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662@30960 myself,slave a54e8fa9474c620154f4c1abc9628116deb3dc7e 0 1713335442000 1 connectedff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713335442641 0 connected 0-5461
复制代码


在 master-2 上查看集群拓扑:


127.0.0.1:6379> cluster nodese4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713335448690 2 connected 10923-16383ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713335448892 0 connected 0-5461a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 myself,master - 0 1713335448000 1 connected 5462-109224d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662@30960 slave a54e8fa9474c620154f4c1abc9628116deb3dc7e 0 1713335448998 1 connected3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713335448794 0 connected
复制代码

3. slave-3 announce & meet

先 announce 后 add-node:


slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 172.18.0.2slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-port 30110slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-bus-port 30971slc@slcmac redis % kubectl exec -it redisc-shard-xwz-1 -c redis-cluster -- /bin/bashroot@redisc-shard-xwz-1:/# redis-cli -a O3605v7HsS --cluster add-node 172.18.0.2:30110 172.18.0.2:31993 --cluster-slave --cluster-master-id e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b>>> Adding node 172.18.0.2:30110 to cluster 172.18.0.2:31993>>> Performing Cluster Check (using node 172.18.0.2:31993)M: e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993   slots:[10923-16383] (5461 slots) masterM: ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039   slots:[0-5461] (5462 slots) master   1 additional replica(s)S: 3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309   slots: (0 slots) slave   replicates ff935854b7626a7e4374598857d5fbe998297799M: a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182   slots:[5462-10922] (5461 slots) master   1 additional replica(s)S: 4d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662   slots: (0 slots) slave   replicates a54e8fa9474c620154f4c1abc9628116deb3dc7e[OK] All nodes agree about slots configuration.>>> Check for open slots...>>> Check slots coverage...[OK] All 16384 slots covered.>>> Send CLUSTER MEET to node 172.18.0.2:30110 to make it join the cluster.Waiting for the cluster to join
>>> Configure node as replica of 172.18.0.2:31993.[OK] New node added correctly.
复制代码


在任何一个 master 上查看集群拓扑:


127.0.0.1:6379> cluster nodese4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 172.18.0.2:31993@30105 master - 0 1713335724101 2 connected 10923-16383ff935854b7626a7e4374598857d5fbe998297799 172.18.0.2:30039@32461 master - 0 1713335724101 0 connected 0-5461a54e8fa9474c620154f4c1abc9628116deb3dc7e 172.18.0.2:30182@31879 myself,master - 0 1713335724000 1 connected 5462-109224d497f9b4ff459b8c65f50afa6621e122e1d8470 172.18.0.2:30662@30960 slave a54e8fa9474c620154f4c1abc9628116deb3dc7e 0 1713335724404 1 connected3a136cd50eb3f2c0dcc3844a0de63d5e44b462d7 172.18.0.2:31309@31153 slave ff935854b7626a7e4374598857d5fbe998297799 0 1713335724510 0 connected161ff6ea42047be45d986ed8ba4505afd07096d9 172.18.0.2:30110@30971 slave e4d9b914e7ee7c4fd399bdf3dd1c98f7a0a1791b 0 1713335724101 2 connected
复制代码


至此集群处于完整的 3 主 3 备形态。

About CNI

1. k3s + Flannel + NodePort/Pod

k3s/k3d 默认使用的 CNI 为 flannel,如上分析,flannel 会有 NAT 映射的问题。

2. k3s + Calico + NodePort

我们又测试了 k3s + Calico 的场景,Calico 使用 vxlan 来建立 Pod 网络,测试发现,当使用 NodePort 时,在 Calico 上依然存在 NAT 问题,假设我们使用的 NodePort 是 10.128.0.52:32135,在入方向上,到本地 16379 端口的通信的 NodePort (10.128.0.52)依然会被转化为 Node 所在主机 vxlan.calico 网络设备的地址(192.168.238.0)。


这是其中一个 slave 的网络连接:


root@redisc-shard-ffv-1:/# netstat -anop | grep redistcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:41800    10.128.0.52:32135       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:45578    10.128.0.52:31952       ESTABLISHED 1/redis-server *:63  keepalive (277.76/0/0) // 到远端的 NodePorttcp        0      0 127.0.0.1:6379          127.0.0.1:45998         ESTABLISHED 1/redis-server *:63  keepalive (185.62/0/0)tcp        0      0 192.168.32.136:53280    10.128.0.52:32675       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    192.168.238.0:8740      ESTABLISHED 1/redis-server *:63  keepalive (8.79/0/0) // 来自远端的经过 NAT 的 NodePorttcp        0      0 192.168.32.136:16379    192.168.238.0:9617      ESTABLISHED 1/redis-server *:63  keepalive (1.70/0/0)tcp        0      0 192.168.32.136:34040    10.128.0.52:31454       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    192.168.238.0:18110     ESTABLISHED 1/redis-server *:63  keepalive (1.82/0/0)tcp        0      0 192.168.32.136:39006    10.128.0.52:30390       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    192.168.238.0:32651     ESTABLISHED 1/redis-server *:63  keepalive (1.57/0/0)tcp        0      0 192.168.32.136:54986    10.128.0.52:30459       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    192.168.238.0:43310     ESTABLISHED 1/redis-server *:63  keepalive (1.83/0/0)tcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


在 Node 10.128.0.52 上可以看到两个设备:


ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460        inet 10.128.0.52  netmask 255.255.255.255  broadcast 0.0.0.0        inet6 fe80::4001:aff:fe80:34  prefixlen 64  scopeid 0x20<link>        ether 42:01:0a:80:00:34  txqueuelen 1000  (Ethernet)        RX packets 3228477  bytes 3975395572 (3.9 GB)        RX errors 0  dropped 0  overruns 0  frame 0        TX packets 3025699  bytes 2382110168 (2.3 GB)        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1410        inet 192.168.238.0  netmask 255.255.255.255  broadcast 0.0.0.0        inet6 fe80::64b2:cdff:fe99:7f96  prefixlen 64  scopeid 0x20<link>        ether 66:b2:cd:99:7f:96  txqueuelen 1000  (Ethernet)        RX packets 587707  bytes 714235654 (714.2 MB)        RX errors 0  dropped 0  overruns 0  frame 0        TX packets 810205  bytes 682665081 (682.6 MB)        TX errors 0  dropped 31 overruns 0  carrier 0  collisions 0
复制代码


如果 NodePort 使用的 Node 为 Pod 所在的主机,在 Calico 中不会被 NAT。


slc@cluster-1:~$ kubectl exec -it redisc-shard-ffv-1 -c redis-cluster -- redis-cli -a O3605v7HsS config set cluster-announce-ip 10.128.0.54 // 把 announced ip 设为 Pod 所在本地 Node ipOKslc@cluster-1:~$ kubectl exec -it redisc-shard-ffv-1 -c redis-cluster -- /bin/bashroot@redisc-shard-ffv-1:/# netstat -anop | grep redistcp        0      0 0.0.0.0:16379           0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    10.128.0.54:44757       ESTABLISHED 1/redis-server *:63  keepalive (6.92/0/0)tcp        0      0 192.168.32.136:41800    10.128.0.52:32135       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    10.128.0.54:16772       ESTABLISHED 1/redis-server *:63  keepalive (0.64/0/0)tcp        0      0 192.168.32.136:45578    10.128.0.52:31952       ESTABLISHED 1/redis-server *:63  keepalive (70.79/0/0)tcp        0      0 127.0.0.1:6379          127.0.0.1:45998         ESTABLISHED 1/redis-server *:63  keepalive (0.00/0/0)tcp        0      0 192.168.32.136:53280    10.128.0.52:32675       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    10.128.0.54:16440       ESTABLISHED 1/redis-server *:63  keepalive (8.62/0/0)tcp        0      0 192.168.32.136:34040    10.128.0.52:31454       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    10.128.0.54:28655       ESTABLISHED 1/redis-server *:63  keepalive (0.14/0/0)tcp        0      0 192.168.32.136:39006    10.128.0.52:30390       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:54986    10.128.0.52:30459       ESTABLISHED 1/redis-server *:63  off (0.00/0/0)tcp        0      0 192.168.32.136:16379    10.128.0.54:29959       ESTABLISHED 1/redis-server *:63  keepalive (8.62/0/0)tcp6       0      0 :::16379                :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)tcp6       0      0 :::6379                 :::*                    LISTEN      1/redis-server *:63  off (0.00/0/0)
复制代码


所以在 Calico vxlan 方案中 NodePort 是否做 SNAT 是和 source Node 地址有关,如果是本机 Node 则不做 SNAT,如果是远端 Node 则需要做 SNAT,但是由于我们做了显式 announce,所以在 redis cluster meet 时也不会有问题。

3. k3s + Calico + Pod

如果只使用 pod ip,redis cluster 会正常 meet,集群拓扑正确。

总结

  1. 在某些 k8s 版本中,根据 CNI 的实现不同,pod 和 nodeport 可能会被 NAT 转换,NAT 转换后的 ip 和 port 无法让集群中其他角色回连,然后 meet 失败。

  2. 由于上述机制的存在,在 k8s 中创建 redis cluster,要么使用 host network;要么使用 NodePort 并显式 announce ip/port/bus-port;对于纯粹的 pod 网络 && 不显式 announce 的场景,需要杜绝 NAT,而这依赖于 CNI 的实现。

  3. Redis cluster 的内部通信和外部通信共享了一套 ip 地址,announce ip 之后,会使用 announce ip 覆盖 pod ip 进行后续的通信,这样会导致内部的 gossip 协商过程也走了 announce 网络,这是一种不必要的浪费,所以未来的建议是内部协议链路和外部应用的数据链路分开。

  4. 但是即使把 pod ip 和 announce ip 使用分开,内部通信走 pod 网络,外部和 client 之间的数据链路走 announce 网络,也无法解决 CNI NAT 转换的问题,由于 redis cluster 回连机制的存在,对于 NAT 之后的地址是无法直接回连的,这里需要在 redis cluster 通信协议上做扩充,理想的情况是:1)内部通信:pod 网络,需要回连,带上原始的 pod ip 作为 source ip,即使经过 NAT 转换也能获取 source ip;2)外部通信:announce 网络,可以是 NodePort/LoadBalancer,不需要回连,无所谓是否 NAT。当然内部通信也可以走 NodePort 和 LoadBalancer,但是前提也是带上原始 source ip(announce ip 其实也是一种 source ip),这也是 KubeBlocks 目前的解决方案

  5. 使用 NodePort 会引入另外一个问题,当 Node Down 之后需要更新 cluster 节点的 announce ip,这个实现难度其实也不小,需要 operator 和 HA 节点的努力配合。

用户头像

小猿姐

关注

还未添加个人签名 2022-08-11 加入

每个开发者都想知道的云原生和数据库技术

评论

发布
暂无评论
Redis Cluster on K8s 大揭密_数据库_小猿姐_InfoQ写作社区