通过 Chaos-Mesh 打造更稳定 TiDB 数据库高可用架构 (一)

2023-03-10
北京
本文字数：20239 字
阅读完需：约 66 分钟

作者： lqbyz 原文来源：https://tidb.net/blog/eff7080c

一、简介

本文主要介绍 chaos-mesh 相关的知识包括混沌工程 Chaos-Mesh 的简介、核心功能、架构预览以及相关实验的功能，为后边构建 tidb 容器化数据库做准备。

1、Chaos-Mesh 简介

Chaos Mesh 是一个开源的云原生混沌工程平台，提供丰富的故障模拟类型，具有强大的故障场景编排能力，方便用户在开发测试中以及生产环境中模拟现实世界中可能出现的各类异常，帮助用户发现系统潜在的问题。Chaos Mesh 提供完善的可视化操作，旨在降低用户进行混沌工程的门槛。用户可以方便地在 Web UI 界面上设计自己的混沌场景，以及监控混沌实验的运行状态。

2、Chaos Mesh 核心功能

Chaos Mesh 作为业内领先的混沌测试平台，具备以下核心优势：
核心能力稳固：Chaos Mesh 起源于 TiDB 的核心测试平台，发布初期即继承了大量 TiDB 已有的测试经验。
被充分验证：Chaos Mesh 被众多公司以及组织所使用，例如腾讯和美团等；同时被用于众多知名分布式系统的测试体系中，例如 Apache APISIX 和 RabbitMQ 等。
系统易用性强：图形化操作和基于 Kubernetes 的使用方式，充分利用了自动化能力。
云原生：Chaos Mesh 原生支持 Kubernetes 环境，提供了强悍的自动化能力。
丰富的故障模拟场景：Chaos Mesh 几乎涵盖了分布式测试体系中基础故障模拟的绝大多数场景。
灵活的实验编排能力：用户可以通过平台设计自己的混沌实验场景，场景可包含多个混沌实验编排，以及应用状态检查等。
安全性高：Chaos Mesh 具有多层次安全控制设计，提供高安全性。
活跃的社区：Chaos Mesh 为全球知名开源混沌测试平台，CNCF 开源基金会孵化项目。
强大的扩展能力：Chaos Mesh 为故障测试类型扩展和功能扩展提供了充分的扩展能力。

3、架构概览

Chaos Mesh 基于 Kubernetes CRD (Custom Resource Definition) 构建，根据不同的故障类型定义多个 CRD 类型，并为不同的 CRD 对象实现单独的 Controller 以管理不同的混沌实验。Chaos Mesh 主要包含以下三个组件:
Chaos Dashboard：Chaos Mesh 的可视化组件，提供了一套用户友好的 Web 界面，用户可通过该界面对混沌实验进行操作和观测。同时，Chaos Dashboard 还提供了 RBAC 权限管理机制。
Chaos Controller Manager：Chaos Mesh 的核心逻辑组件，主要负责混沌实验的调度与管理。该组件包含多个 CRD Controller，例如 Workflow Controller、Scheduler Controller 以及各类故障类型的 Controller。
Chaos Daemon：Chaos Mesh 的主要执行组件。Chaos Daemon 以 DaemonSet 的方式运行，默认拥有 Privileged 权限（可以关闭）。该组件主要通过侵入目标 Pod Namespace 的方式干扰具体的网络设备、文件系统、内核等。

二、安装部署

1. 环境准备

1.在安装之前，请先确保环境中已经安装 Helm。[root@k8s-master chaos-mesh]# helm versionversion.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"clean", GoVersion:"go1.14.11"}2.添加chaos mesh 仓库helm repo add chaos-mesh https://charts.chaos-mesh.org3.查看安装chaos mesh版本helm search repo chaos-mesh或helm search repo chaos-mesh -l 4.创建命名空间kubectl create ns chaos-testing5.安装docker 环境helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing6.验证安装[root@k8s-master chaos-mesh]# kubectl get po -n chaos-testingNAME                                        READY   STATUS    RESTARTS   AGEchaos-controller-manager-856bc96c68-6mppc   1/1     Running   0          6h49mchaos-controller-manager-856bc96c68-hk6nl   1/1     Running   0          6h50mchaos-controller-manager-856bc96c68-q99vm   1/1     Running   0          6h50mchaos-daemon-ng4vx                          1/1     Running   0          6h49mchaos-daemon-w2w7h                          1/1     Running   0          6h50mchaos-dashboard-5fdf8b8bb-nnnhz             1/1     Running   0          6h50m备注为了保证高可用性，Chaos Mesh 默认开启了 leader-election 特性。如果不需要这个特性，请通过 --set controllerManager.leaderElection.enabled=false 手动关闭该特性。6.升级chaos meshhelm upgrade chaos-mesh chaos-mesh/chaos-mesh7.卸载chaos meshhelm uninstall chaos-mesh -n chaos-testing

复制代码

2. 管理用户权限

2.1、通过 token 进行登陆

1、创建用户并绑定权限。访问dashboard点击这里生成2、创建令牌辅助生成器：  2.1：选择权限的范围  2.2：选择角色  2.3：生成rbac配置  2.4：点击复制3、创建用户并绑定权限[root@k8s-master chaos-mesh]# cat /chaosMesh/rbac.yml kind: ServiceAccountapiVersion: v1metadata:  namespace: tidb  name: account-tidb-manager-aypth---kind: RoleapiVersion: rbac.authorization.k8s.io/v1metadata:  namespace: tidb  name: role-tidb-manager-aypthrules:- apiGroups: [""]  resources: ["pods", "namespaces"]  verbs: ["get", "watch", "list"]- apiGroups:  - chaos-mesh.org  resources: [ "*" ]  verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata:  name: bind-tidb-manager-aypth  namespace: tidbsubjects:- kind: ServiceAccount  name: account-tidb-manager-aypth  namespace: tidbroleRef:  kind: Role  name: role-tidb-manager-aypth  apiGroup: rbac.authorization.k8s.iokubectl apply -f rbac.yml4、生成令牌，并查看kubectl describe -n tidb secrets account-tidb-manager-aypthName:         account-tidb-manager-aypth-token-z4kvcNamespace:    tidbLabels:       <none>Annotations:  kubernetes.io/service-account.name: account-tidb-manager-aypth              kubernetes.io/service-account.uid: 98910f01-64b1-489c-be76-ab9241c6514aType:  kubernetes.io/service-account-tokenData====token:      eyJhbGciOiJSUzI1NiIsImtpZCI6IlYxc2pxT1hRQkdZNGFaLUtPOWpEYVZLM1FIeFJPVzFvOXA2aGp6RS0xSjQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJ0aWRiIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImFjY291bnQtdGlkYi1tYW5hZ2VyLWF5cHRoLXRva2VuLXo0a3ZjIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6ImFjY291bnQtdGlkYi1tYW5hZ2VyLWF5cHRoIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiOTg5MTBmMDEtNjRiMS00ODljLWJlNzYtYWI5MjQxYzY1MTRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OnRpZGI6YWNjb3VudC10aWRiLW1hbmFnZXItYXlwdGgifQ.qZoomZT5ncAxCuRZ6R5hspa5tmqWMUHaNjjnM_Psa3HeShSYlcM-0ruVjtVj1-g-I2vCyLKYAUCuu4MHCEaULBBonwDwUHM1kqGH6EhrfBBKeLJ1H8EedsDA65RDoiBoYlJqnUi0NGrSbWHYVOEuPcoHTpRAS0gLvwtT77qkc4favMkwB0cX-wxgeBlgLqCq-i98PlOTs4-jQel6gO0j6kE38_sB1o8Bqk4my4NNv95SNZCIuiiwzipYTz7b9bmK3lF4A2s9BK6R6_7kBT5SPZ_YnIIb-C2rHZy0zUvZUsLBjPG32Wi0TDD1LF9A1lQz5lXwTZlyzrWeq082NmnMzwca.crt:     1066 bytesnamespace:  4 bytes5、用令牌进行登录

复制代码

2.2、关闭 token 登陆 (不安全)

使用 Helm 安装 Chaos Mesh 时，默认开启权限验证功能。对于生产环境及其他安全要求较高的场景，建议都保持权限验证功能开启。如果只是想体验 Chaos Mesh 的功能，希望关闭权限验证从而快速创建混沌实验，可以在 Helm 命令中设置 --set dashboard.securityMode=false，命令如下所示：

helm upgrade chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --version 2.1.4 --set dashboard.securityMode=false备注，如果想重新开启权限验证功能，再重新设置 --set dashboard.securityMode=true 即可。

复制代码

三、混沌工程的实验的类型

(一）、实验环境的准备

创建对应的 pod 的 deployment
> 1、创建通过deployment创建相关的pod服务
> #cat web-show.yml 
> apiVersion: apps/v1
> kind: Deployment
> metadata:
> name: webshow-deployment
> labels:
>  app: webshow-deployment
> spec:
> replicas: 1
> selector:
>  matchLabels:
>    app: webshow-deployment
> template:
>  metadata:
>    labels:
>      app: webshow-deployment
>  spec:
>    containers:
>         - name: webshow-deployment
>           image: pingcap/web-show
>           imagePullPolicy: Always
>           command:
>             - /usr/local/bin/web-show
>             - --target-ip=${TARGET_IP}
>           ports:
>             - name: web-port
>               containerPort: 8081
>               hostPort: 8081
> 
> 2、创建相关的服务
> #kubectl apply -f web-show.yml
> 
> 3、通过master节点把服务的端口映射出去
> #nohup kubectl port-forward --address 0.0.0.0 deployment.apps/webshow-deployment 8081:8081 -n  chaosmesh-test &
> 
> 4、若端口有问题，杀掉重启端口映射步骤3
> kill $(lsof -t -i:8081) > /dev/null  2>&1 ||true
> 
> 5、正常访问的页面如下：
> ```

### (二)、实验

#### 3.2.1、创建pod类型的POD FAILURE测试

##### 1.点击实验--新建实验

##### 2.依次选择实验类型：KUBERNETES 、POD故障

##### 3.填写实验信息选项卡

备注：mode的相关信息有：

> 指定实验的运行方式，可选择的方式包括：`one`（表示随机选出一个符合条件的 Pod）、`all`（表示选出所有符合条件的 Pod）、`fixed`（表示选出指定数量且符合条件的 Pod）、`fixed-percent`（表示选出占符合条件的 Pod 中指定百分比的 Pod）、`random-max-percent`（表示选出占符合条件的 Pod 中不超过指定百分比的 Pod）

##### 4.提交相关的信息。

##### 5.通过k8s的master节点监控查看pod 的相关情况
#watch kubectl get pod,PodChaos,StressChaos,NetworkChaos -n chaosmesh-testNAME READY STATUS RESTARTS AGEpod/webshow-deployment-6cbdcc4cd4-ljbtk ¹⁄₁ Running 7 6h43mNAME AGEpodchaos.chaos-mesh.org/pod-containers-kill 7h13mpodchaos.chaos-mesh.org/pod-failure-01 20mpodchaos.chaos-mesh.org/pod-kill 8hpodchaos.chaos-mesh.org/pod-kill-all 6h43mpodchaos.chaos-mesh.org/pod-kill03 8hNAME DURATIONstresschaos.chaos-mesh.org/pod-cpu 5mNAME ACTION DURATIONnetworkchaos.chaos-mesh.org/network-delay loss 5mnetworkchaos.chaos-mesh.org/network-delay-02 delay 5mnetworkchaos.chaos-mesh.org/pod-network-delay delay 70snetworkchaos.chaos-mesh.org/pod-network-loss loss 120snetworkchaos.chaos-mesh.org/pod-network-loss-01 loss 2m
##### 6.当执行任务是出现相关的问题，如截图

##### 7. 通过kubectl检查实验结果

> ##### 可以使用 `kubectl describe` 命令查看此混沌实验对象的 `Status` 和 `Events`，从而确定实验结果
kubectl describe networkchaos.chaos-mesh.org/network-delay -nchaosmesh-test
Name: network-delayNamespace: chaosmesh-testLabels: Annotations: experiment.chaos-mesh.org/pause: falseAPI Version: chaos-mesh.org/v1alpha1Kind: NetworkChaosMetadata: Creation Timestamp: 2022-04-01T08:06:54Z Finalizers: chaos-mesh/records Generation: 24 Managed Fields: API Version: chaos-mesh.org/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:“chaos-mesh/records”: f:status: f:conditions: f:experiment: f:containerRecords: f:desiredPhase: f:instances: .: f:chaosmesh-test/webshow-deployment-6cbdcc4cd4-ljbtk: Manager: chaos-controller-manager Operation: Update Time: 2022-04-01T08:06:54Z API Version: chaos-mesh.org/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:experiment.chaos-mesh.org/pause: f:spec: .: f:action: f:direction: f:duration: f:loss: .: f:loss: f:mode: f:selector: .: f:labelSelectors: .: f:app: f:namespaces: f:status: .: f:experiment: Manager: chaos-dashboard Operation: Update Time: 2022-04-01T08:11:11Z Resource Version: 1305926 UID: ee609703-aa48-4b55-9ff2-88b4aab967b5Spec: Action: loss Direction: to Duration: 5m Loss: Correlation: 0 Loss: 80 Mode: all Selector: Label Selectors: App: webshow-deployment Namespaces: chaosmesh-testStatus: Conditions: Reason: Status: False Type: AllInjected Reason: Status: True Type: AllRecovered Reason: Status: False Type: Paused Reason: Status: True Type: Selected Experiment: Container Records: Id: chaosmesh-test/webshow-deployment-6cbdcc4cd4-ljbtk Phase: Not Injected Selector Key: . Desired Phase: Stop Instances: chaosmesh-test/webshow-deployment-6cbdcc4cd4-ljbtk: 11Events:
上述输出中，主要包含两部分：

- `Status`

  依据混沌实验的执行流程，`Status` 提供了以下四类状态记录：

  - `Paused`： 代表混沌实验正处于暂停阶段。
  - `Selected`： 代表混沌实验已经正确选择出待测试目标。
  - `AllInjected`：代表所有测试目标都已经被成功注入故障。
  - `AllRecoverd`：代表所有测试目标的故障都已经被成功恢复。

  可以通过这四类状态记录推断出当前混沌实验的真实运行情况。例如：

  - 当 `Paused`、`Selected`、`AllRecoverd` 的状态是 `True` 且 `AllInjected` 的状态是 `False`时，说明当前实验处在暂停状态。
  - 当 `Paused` 为 `True` 的时，说明当前实验处在暂停状态，但是如果此时的 `Selected` 值为 `False`，那么可以进一步得出此混沌实验无法选出待测试目标。

  ##### 注意

- 你可以从上述的四类实验记录组合中可以推导出更多的信息，例如当 `Paused` 为 `True` 的时候，说明混沌实验处在暂停状态，但是如果此时的 `Selected` 值为 `False`，那么可以进一步得出此混沌实验无法选出待测试目标。

- `Events`

- 事件列表中包含混沌实验整个生命周期中的操作记录，可以帮助确定混沌实验状态并排除问题。

##### 8.查看dashboard界面
##### 9.实验结束，查看pod的服务是否正常地

##### 10.把实验步骤进行归档

如果你想要在 Dashboard 上删除混沌实验并归档到历史记录汇总，可以点击对应混沌实验的**归档**按钮。

#### 3.3.3、模拟网络故障

> - 请在进行网络注入的过程中保证 Controller Manager 与 Chaos Daemon 之间的连接通畅，否则将无法恢复。
> - 如果使用 Net Emulation 功能，请确保 Linux 内核拥有 NET\_SCH\_NETEM 模块。对于 CentOS 可以通过 kernel-modules-extra 包安装，大部分其他发行版已默认安装相应模块。

#### （一）模拟LOSS

##### 1.依次选择新建--网络攻击--LOSS

> loss:表示丢包发生的概率。取值范围：\[0, 100]
>
> correlation:表示延迟时间的时间长度与前一次延迟时长的相关性。取值范围：\[0, 100]
>
> direction: 值为 `from`，`to` 或 `both`。用于指定选出“来自 target 的包”，“发往 target 的包”，或者“全部选中”
>
> externalTargets: 表示 Kubernetes 之外的网络目标, 可以是 IPv4 地址或者域名。只能与 `direction: to` 一起工作。 如8.8.8.8 baidu.com

##### 2.填写实验信息，并提交。

##### 3.进入该容器内部进行相关的ping操作，会出现丢包现象。
[root@k8s-master ~]# kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/shkubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] – [COMMAND] instead.sh-4.2# ping 8.8.8.8PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.64 bytes from 8.8.8.8: icmp_seq=54 ttl=108 time=53.5 ms^C— 8.8.8.8 ping statistics —67 packets transmitted, 1 received, 98% packet loss, time 67604msrtt min/avg/max/mdev = 53.⁵⁹⁸⁄₅₃.⁵⁹⁸⁄₅₃.⁵⁹⁸⁄₀.000 mssh-4.2# ping 8.8.8.8PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
#### (二)、模拟delay场景

##### 1.创建相关的配置

##### 2.查看相关的实验信息，并点击开始

##### 3.验证相关结果，通过进入pod，master节点和pod对应的work节点进行ping测试
1. 进入容器进行 ping 外网
kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/sh
#ping 8.8.8.82. 在 master 节点 ping 该 pod 的 ip 地址 3. 在该 pod 所在的 work 节点 ping 该 pod 的地址。总结：通过 ping 发现该地址均出现 ping 延迟或丢包现象。
备注：

**字段说明**

| 参数              | 类型        | 说明                                                                                                                                                                                             | 默认值 | 是否必填 | 示例                                                |
| --------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ---- | ------------------------------------------------- |
| action          | string    | 表示具体的故障类型。netem，delay，loss，duplicate，corrupt 对应 net emulation 类型；partition 表示网络分区；bandwidth 表示限制带宽                                                                                             | 无   | 是    | partition                                         |
| target          | Selector  | 与 direction 组合使用，使得 Chaos 只对部分包生效                                                                                                                                                              | 无   | 否    |                                                   |
| direction       | enum      | 值为 `from`，`to` 或 `both`。用于指定选出“来自 target 的包”，“发往 target 的包”，或者“全部选中”                                                                                                                           | to  | 否    | both                                              |
| mode            | string    | 指定实验的运行方式，可选择的方式包括：`one`（表示随机选出一个符合条件的 Pod）、`all`（表示选出所有符合条件的 Pod）、`fixed`（表示选出指定数量且符合条件的 Pod）、`fixed-percent`（表示选出占符合条件的 Pod 中指定百分比的 Pod）、`random-max-percent`（表示选出占符合条件的 Pod 中不超过指定百分比的 Pod） | 无   | 是    | `one`                                             |
| value           | string    | 取决与 `mode` 的配置，为 `mode` 提供对应的参数。例如，当你将 `mode` 配置为 `fixed-percent` 时，`value` 用于指定 Pod 的百分比                                                                                                      | 无   | 否    | 1                                                 |
| containerNames  | \[]string | 指定注入的容器名称                                                                                                                                                                                      | 无   | 否    | \["nginx"]                                        |
| selector        | struct    | 指定注入故障的目标 Pod，详情请参考[定义实验范围](https://chaos-mesh.org/zh/docs/define-chaos-experiment-scope/)                                                                                                     | 无   | 是    |                                                   |
| externalTargets | \[]string | 表示 Kubernetes 之外的网络目标, 可以是 IPv4 地址或者域名。只能与 `direction: to` 一起工作。                                                                                                                               | 无   | 否    | 1.1.1.1, [www.google.com](http://www.google.com/) |
| device          | string    | 指定影响的网络设备                                                                                                                                                                                      | 无   | 否    | "eth0"                                            |

|             |                   |                                        |     |      |     |
| ----------- | ----------------- | -------------------------------------- | --- | ---- | --- |
| 参数          | 类型                | 说明                                     | 默认值 | 是否必填 | 示例  |
| latency     | string            | 表示延迟的时间长度                              | 0   | 否    | 2ms |
| correlation | string            | 表示延迟时间的时间长度与前一次延迟时长的相关性。取值范围：\[0, 100] | 0   | 否    | 50  |
| jitter      | string            | 表示延迟时间的变化范围                            | 0   | 否    | 1ms |
| reorder     | Reorder(#Reorder) | 表示网络包乱序的状态                             |     | 否    |     |

具体可以参考<https://chaos-mesh.org/zh/docs/simulate-network-chaos-on-kubernetes/#Loss>

#### 3.3.4、模拟压力场景

##### 1.依次选择dashboard--实验--新的实验--压力测试

##### 2.查看cpu的相关测试信息，通过进入pod内部和pod所在的计算节点
1. 进入容器内部看负载[root@k8s-master ~]# kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/shkubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] – [COMMAND] instead.sh-4.2# toptop - 03:17:58 up 7 days, 23:04, 0 users, load average: 6.33, 1.99, 0.75Tasks: 16 total, 5 running, 11 sleeping, 0 stopped, 0 zombie%Cpu(s): 93.8 us, 4.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 stKiB Mem : 8154912 total, 240500 free, 2111328 used, 5803084 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 5718908 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 46 root 20 0 291064 253740 1344 R 100.0 3.1 1:02.55 stress-ng-vm 44 root 20 0 291064 253740 1344 R 100.0 3.1 1:03.47 stress-ng-vm 43 root 20 0 291064 253740 1344 R 93.3 3.1 1:04.10 stress-ng-vm 45 root 20 0 291064 253740 1344 R 60.0 3.1 1:02.40 stress-ng-vm 1 root 20 0 112976 15316 6820 S 0.0 0.2 0:25.18 web-show 34 root 20 0 41060 7840 5452 S 0.0 0.1 0:00.00 stress-ng 35 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 36 root 20 0 41704 9252 3540 S 0.0 0.1 0:01.97 stress-ng-cpu 37 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 38 root 20 0 41704 9192 3476 S 0.0 0.1 0:01.62 stress-ng-cpu 39 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 40 root 20 0 41704 9252 3540 S 0.0 0.1 0:01.64 stress-ng-cpu 41 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 42 root 20 0 41704 9192 3476 S 0.0 0.1 0:01.66 stress-ng-cpu 47 root 20 0 11832 2684 2456 S 0.0 0.0 0:00.01 sh sh-4.2# toptop - 03:19:13 up 7 days, 23:05, 0 users, load average: 8.91, 3.79, 1.48Tasks: 16 total, 5 running, 11 sleeping, 0 stopped, 0 zombie%Cpu(s): 98.6 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 stKiB Mem : 8154912 total, 239884 free, 2111728 used, 5803300 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 5718512 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 46 root 20 0 291064 253740 1344 R 94.3 3.1 2:11.74 stress-ng-vm 45 root 20 0 291064 253740 1344 R 93.7 3.1 2:13.74 stress-ng-vm 44 root 20 0 291064 253740 1344 R 93.3 3.1 2:13.47 stress-ng-vm 43 root 20 0 291064 253740 1344 R 91.3 3.1 2:13.76 stress-ng-vm 38 root 20 0 41704 9192 3476 S 11.7 0.1 0:03.40 stress-ng-cpu 1 root 20 0 112976 15316 6820 S 0.0 0.2 0:25.21 web-show 34 root 20 0 41060 7840 5452 S 0.0 0.1 0:00.00 stress-ng 35 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 36 root 20 0 41704 9252 3540 S 0.0 0.1 0:03.71 stress-ng-cpu 37 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 39 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 40 root 20 0 41704 9252 3540 S 0.0 0.1 0:02.83 stress-ng-cpu 41 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm 42 root 20 0 41704 9192 3476 S 0.0 0.1 0:02.85 stress-ng-cpu 47 root 20 0 11832 2804 2440 S 0.0 0.0 0:00.01 sh [1]+ Stopped(SIGSTOP) topsh-4.2# uptime 03:19:22 up 7 days, 23:05, 0 users, load average: 8.95, 3.97, 1.562. 查看计算节点的负载[root@k8s-node1 ~]# toptop - 11:19:55 up 7 days, 23:06, 1 user, load average: 8.36, 4.30, 1.75Tasks: 189 total, 6 running, 115 sleeping, 1 stopped, 0 zombie%Cpu(s): 98.5 us, 1.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 stKiB Mem : 8154912 total, 144968 free, 2029056 used, 5980888 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 5623656 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5484 root 20 0 291064 253740 1344 R 97.3 3.1 2:52.08 stress-ng-vm 5485 root 20 0 322328 284904 1344 R 93.7 3.5 2:52.45 stress-ng-vm 5486 root 20 0 322328 284860 1344 R 91.0 3.5 2:50.38 stress-ng-vm 5483 root 20 0 322328 284792 1344 R 89.4 3.5 2:52.82 stress-ng-vm 5476 root 20 0 41704 9252 3540 S 9.6 0.1 0:05.00 stress-ng-cpu 13676 tidb 20 0 10.5g 209764 58128 S 9.3 2.6 432:52.32 pd-server 22361 root 20 0 1986632 125576 70520 S 3.0 1.5 307:20.30 kubelet 30096 root 20 0 752296 57752 35756 S 1.3 0.7 22:41.88 kube-scheduler 31181 root 20 0 753256 62176 35336 S 1.0 0.8 5:01.58 chaos-controlle 2340 root 20 0 1695896 108680 53292 S 0.7 1.3 107:31.90 dockerd 873 root 20 0 21544 2704 2456 S 0.3 0.0 0:20.71 irqbalance 25407 root 20 0 711016 14412 6096 S 0.3 0.2 0:54.73 containerd-shim 1 root 20 0 191568 5648 3700 S 0.0 0.1 0:46.20 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.38 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp [root@k8s-node1 ~]# uptime 11:19:58 up 7 days, 23:06, 1 user, load average: 8.49, 4.40, 1.79
##### 3.中间暂停，发下cpu负载下来，当继续的时候又上来了，知道该实验结束。

### (二)、工作流

> 为满足该需求，Chaos Mesh 提供了 Chaos Mesh Workflow，一个内置的工作流引擎。使用该引擎，你可以串行或并行地执行多种不同的 Chaos 实验， 用于模拟生产级别的错误。

目前， Chaos Mesh Workflow 支持以下功能：

- 串行编排
- 并行编排
- 自定义任务
- 条件分支

使用场景举例：

- 使用并行编排同时注入多个 NetworkChaos 模拟复杂的网络环境
- 在串行编排中进行健康检查，使用条件分支决定是否执行剩下的步骤

Chaos Mesh Workflow 在设计时一定程度上参考了 Argo Workflow。如果您熟悉 Argo Workflow 您也能很快地上手 Chaos Mesh Workflow。

具体可以参考<https://chaos-mesh.org/zh/docs/create-chaos-mesh-workflow/>

### (三)、计划

在 Kubernetes 中，Chaos Mesh 使用 `Schedule` 对象来描述定时任务。

> 一个 `Schedule` 对象名不应超过 57 字符，因为它创建的混沌实验将在名字后额外添加 6 位随机字符。一个包含有 `Workflow` 的 `Schedule` 对象名不应超过 51 字符，因为 Workflow 也将在创建的名字后额外添加 6 位随机字符。
schedule 字段•schedule 字段用于指定实验发生的时间。
┌───────────── 分钟 (0 - 59)
│ ┌───────────── 小时 (0 - 23)
│ │ ┌───────────── 月的某天 (1 - 31)
│ │ │ ┌───────────── 月份 (1 - 12)
│ │ │ │ ┌───────────── 周的某天 (0 - 6) （周日到周一；在某些系统上，7 也是星期日）
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * *
| 输入                     | 描述                | 等效替代          |
| ---------------------- | ----------------- | ------------- |
| @yearly (or @annually) | 每年 1 月 1 日的午夜运行一次 | 0 0 1 1 \*    |
| @monthly               | 每月第一天的午夜运行一次      | 0 0 1 \* \*   |
| @weekly                | 每周的周日午夜运行一次       | 0 0 \* \* 0   |
| @daily (or @midnight)  | 每天午夜运行一次          | 0 0 \* \* \*  |
| @hourly                | 每小时的开始一次          | 0 \* \* \* \* |

##### 1.创建工作计划

##### 2.填写计划周期、并发策略等信息

##### 3.提交实验

##### 4.由于schedule是每两分钟执行一次.

> 可以看下pod的cpu负载以及pod所在的work节点的cpu负载,并在master节点查看schedule信息
1. 查看 master 节点的信息 kubectl get pod,PodChaos,StressChaos,schedule -n chaosmesh-test -owide Sat Apr 2 12:03:35 2022NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESpod/webshow-deployment-6cbdcc4cd4-ljbtk ¹⁄₁ Running 7 21h 10.244.3.39 k8s-node1 NAME AGEpodchaos.chaos-mesh.org/pod-containers-kill 21hpodchaos.chaos-mesh.org/pod-failure-01 15hpodchaos.chaos-mesh.org/pod-kill 23hpodchaos.chaos-mesh.org/pod-kill-all 21hpodchaos.chaos-mesh.org/pod-kill03 22hNAME DURATIONstresschaos.chaos-mesh.org/cpu-test-01 10mstresschaos.chaos-mesh.org/pod-cpu 5mstresschaos.chaos-mesh.org/schedule-01-j9n5f 10mNAME AGEschedule.chaos-mesh.org/schedule-01 33m#### 查看 schedule 详细信息[root@k8s-master ~]# kubectl describe schedule.chaos-mesh.org/schedule-01 -nchaosmesh-testName: schedule-01Namespace: chaosmesh-testLabels: Annotations: experiment.chaos-mesh.org/pause: falseAPI Version: chaos-mesh.org/v1alpha1Kind: ScheduleMetadata: Creation Timestamp: 2022-04-02T03:30:07Z Generation: 23 Managed Fields: API Version: chaos-mesh.org/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:status: f:active: f:time: Manager: chaos-controller-manager Operation: Update Time: 2022-04-02T03:32:00Z API Version: chaos-mesh.org/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:experiment.chaos-mesh.org/pause: f:spec: .: f:concurrencyPolicy: f:historyLimit: f:schedule: f:startingDeadlineSeconds: f:stressChaos: .: f:duration: f:mode: f:selector: .: f:namespaces: f:stressors: .: f:cpu: .: f:workers: f:memory: .: f:size: f:workers: f:type: f:status: Manager: chaos-dashboard Operation: Update Time: 2022-04-02T03:36:49Z Resource Version: 1513442 UID: 7a198cb5-feb6-4403-ab37-b3ceab1e954eSpec: Concurrency Policy: Forbid History Limit: 1 Schedule: */2 * * * * Starting Deadline Seconds: 600 Stress Chaos: Duration: 10m Mode: all Selector: Namespaces: chaosmesh-test Stressors: Cpu: Workers: 3 Memory: Size: 1024m Workers: 3 Type: StressChaosStatus: Active: API Version: chaos-mesh.org/v1alpha1 Kind: StressChaos Name: schedule-01-98lvp Namespace: chaosmesh-test Resource Version: 1513440 UID: abcedc4b-1cb4-48ef-923e-f3c2c9cb6934 Time: 2022-04-02T04:04:28ZEvents: Type Reason Age From Message —- —— —- —- ——- Normal Spawned 35m schedule-cron Create new object: schedule-01-j9n5f Normal Updated 35m schedule-cron Successfully update lastScheduleTime of resource Warning Forbid 33m schedule-cron Forbid spawning new job because: schedule-01-j9n5f is still running Normal Spawned 3m5s schedule-cron Create new object: schedule-01-98lvp Normal Updated 3m5s schedule-cron Successfully update lastScheduleTime of resource Warning Forbid 93s schedule-cron Forbid spawning new job because: schedule-01-98lvp is still running2. 查看 pod 负载 top - 04:06:14 up 7 days, 23:52, 0 users, load average: 7.65, 2.91, 1.67Tasks: 14 total, 7 running, 6 sleeping, 1 stopped, 0 zombie%Cpu(s): 98.4 us, 1.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 stKiB Mem : 8154912 total, 276480 free, 2087400 used, 5791032 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 5742844 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 76 root 20 0 374392 335024 1336 R 73.4 4.1 1:04.01 stress-ng-vm 74 root 20 0 41704 8012 3544 R 68.1 0.1 1:07.17 stress-ng-cpu 71 root 20 0 41704 8012 3544 R 65.8 0.1 1:07.89 stress-ng-cpu 75 root 20 0 374392 335024 1336 R 63.5 4.1 1:08.54 stress-ng-vm 72 root 20 0 374392 335024 1336 R 57.5 4.1 1:08.74 stress-ng-vm 69 root 20 0 41704 8012 3544 R 51.5 0.1 1:05.62 stress-ng-cpu 1 root 20 0 112976 15316 6820 S 0.0 0.2 0:26.62 web-show 47 root 20 0 11832 2804 2440 S 0.0 0.0 0:00.01 sh 54 root 20 0 56192 3776 3264 T 0.0 0.0 0:00.00 top 56 root 20 0 56196 3720 3208 R 0.0 0.0 0:00.63 top 67 root 20 0 41056 5628 5280 S 0.0 0.1 0:00.00 stress-ng 68 root 20 0 41060 420 64 S 0.0 0.0 0:00.00 stress-ng-vm 70 root 20 0 41060 420 64 S 0.0 0.0 0:00.00 stress-ng-vm 73 root 20 0 41060 420 64 S 0.0 0.0 0:00.00 stress-ng-vm 3. 查看 work 节点负载[root@k8s-node1 ~]# uptime 12:06:51 up 7 days, 23:53, 1 user, load average: 7.82, 3.46, 1.90
##### 5.暂停定时任务

> 与 `CronJob` 不同，暂停一个 `Schedule` 不仅仅会阻止它创建新的实验，也会暂停已创建的实验。
1. 如果你暂时不想再通过定时任务来创建混沌实验，需要为该 Schedule 对象添加 experiment.chaos-mesh.org/pause=true 注解。可以使用 kubectl 命令行工具添加注解：kubectl annotate -n $N A M E S P A C E s c h e d u l e$ NAME experiment.chaos-mesh.org/pause=true返回结果：schedule/ $N A M E a n n o t a t e d 2 . 如果要解除暂停，可以使用如下命令去除该注解： k u b e c t l a n n o t a t e - n$ NAMESPACE schedule $N A M E e x p e r i m e n t . c h a o s - m e s h . o r g / p a u s e - 返回结果 s c h e d u l e /$ NAME annotated
#### 备注.mode类型查找
https://github.com/chaos-mesh/chaos-mesh/blob/master/api/v1alpha1/selector.goconst ( // OneMode represents that the system will do the chaos action on one object selected randomly. OneMode SelectorMode = “one” // AllMode represents that the system will do the chaos action on all objects // regardless of status (not ready or not running pods includes). // Use this label carefully. AllMode SelectorMode = “all” // FixedMode represents that the system will do the chaos action on a specific number of running objects. FixedMode SelectorMode = “fixed” // FixedPercentMode to specify a fixed % that can be inject chaos action. FixedPercentMode SelectorMode = “fixed-percent” // RandomMaxPercentMode to specify a maximum % that can be inject chaos action. RandomMaxPercentMode SelectorMode = “random-max-percent”)```

发布于: 刚刚阅读数: 2

原文链接:【http://xie.infoq.cn/article/3792d693b7c4fb442550fc480】。文章转载请联系作者。

TiDB 社区干货传送门

关注

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目，旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

发布

暂无评论

创作场景

通过 Chaos-Mesh 打造更稳定 TiDB 数据库高可用架构 (一)

一、简介

1、Chaos-Mesh 简介

2、Chaos Mesh 核心功能

3、架构概览

二、安装部署

1. 环境准备

2. 管理用户权限

2.1、通过 token 进行登陆

2.2、关闭 token 登陆 (不安全)

三、混沌工程的实验的类型

(一）、实验环境的准备

kubectl describe networkchaos.chaos-mesh.org/network-delay -nchaosmesh-test

kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/sh

┌───────────── 分钟 (0 - 59)

│ ┌───────────── 小时 (0 - 23)

│ │ ┌───────────── 月的某天 (1 - 31)

│ │ │ ┌───────────── 月份 (1 - 12)

│ │ │ │ ┌───────────── 周的某天 (0 - 6) （周日到周一；在某些系统上，7 也是星期日）

│ │ │ │ │

│ │ │ │ │

│ │ │ │ │

* * * * *

TiDB 社区干货传送门

评论