TiDB 同城双中心监控组件高可用方案

2023-09-01
北京
本文字数：1466 字
阅读完需：约 5 分钟

作者： Prest13 原文来源：https://tidb.net/blog/44b9b8b1

背景

在双中心部署 tidb dr-auto sync 集群，出于监控的高可用考虑，在物理分离的两个数据中心分别部署独立的 prometheus+alertmanager+grafana，实现任一监控均可访问。
此部署架构需考虑两套监控组件数据采集的一致性，以及监控告警重复发送的问题。

实现思路

两套 Prometheus 组件各自独立进行集群监控信息的采集和存储；

两套 Grafana 连接各自的 Prometheus 作为数据源；

AlertManager 通过集群配置，基于 gossip 机制，在多个 alertmanager 收到相同告警事件后，由其中之一对外发送监控告警信息。

模拟实现

模拟实现的环境

TiDB v7.1.0 LTS

单个集群部署两套监控

# # Server configs are used to specify the configuration of Prometheus Server.monitoring_servers:  - host: 30.0.100.40    port: 9091    deploy_dir: "/tidb/tidb-deploy/prometheus-8249"    data_dir: "/data/tidb-data/prometheus-8249"    log_dir: "/data/tidb-deploy/prometheus-8249/log"  - host: 30.0.100.42    port: 9091    deploy_dir: "/tidb/tidb-deploy/prometheus-8249"    data_dir: "/data/tidb-data/prometheus-8249"    log_dir: "/data/tidb-deploy/prometheus-8249/log"
# # Server configs are used to specify the configuration of Grafana Servers.grafana_servers:  - host: 30.0.100.40    deploy_dir: /data/tidb-deploy/grafana-3000  - host: 30.0.100.42    deploy_dir: /data/tidb-deploy/grafana-3000
# # Server configs are used to specify the configuration of Alertmanager Servers.alertmanager_servers:  - host: 30.0.100.40    deploy_dir: "/data/tidb-deploy/alertmanager-9093"    data_dir: "/data/tidb-data/alertmanager-9093"    log_dir: "/data/tidb-deploy/alertmanager-9093/log"  - host: 30.0.100.42    deploy_dir: "/data/tidb-deploy/alertmanager-9093"    data_dir: "/data/tidb-data/alertmanager-9093"    log_dir: "/data/tidb-deploy/alertmanager-9093/log"

复制代码

调整监控数据链路

grafana 调整 datasource

确认 prometheus 配置，设置 alertmanager 信息

需复用 haproxy+keepalive 反向代理多个 prometheus，并修改 dashboard 的 prometheus 数据源，以免单个 prometheus 故障后影响 dashboard 的使用

haproxy 配置略

dashboard 配置如下

Webhook 实现

编写 webhook 转换为飞书 api 的 golang 程序

略

测试，使用 HTTP 接口测试工具，确认飞书 webhook 小程序接收并解析了相关告警事件

配置 alertmanager webhook

编写 alertmanager 配置文件模板，添加 reciver 及 webhook 定义，存放在 tiup 中控机的路径下
使用 tiup edit-config，添加 alertmanager_server 下的 config_file，路径指向上一步编写的 alertmanager 配置文件
尝试触发告警，确认未产生多条告警