prometheus 架构介绍及部署使用

作者：忙着长大#

2023-04-02
河北
本文字数：4752 字
阅读完需：约 16 分钟

一、Prometheus 各组件的功能

Prometheus Server: Prometheus Server 是整个 Prometheus 系统的核心组件，负责收集和存储指标数据，并提供查询和警报功能。Prometheus Server 可以通过多种方式获取指标数据，例如 HTTP、PushGateway、Exporter 等，并将数据存储在本地的时间序列数据库中。
Exporters: Exporters 是用于收集各种数据源指标的插件。它们是独立的应用程序，可以定期轮询外部系统并将数据转换为 Prometheus 格式。Prometheus Server 可以通过 HTTP 或其他协议从 Exporters 中获取数据，并将其存储在本地的时间序列数据库中。
PushGateway: PushGateway 允许应用程序将指标数据推送到 Prometheus Server。这对于短期任务或批处理作业非常有用，因为它们可能不会一直运行，因此无法像 Exporters 那样持续提供指标数据。
Alertmanager: Alertmanager 负责处理警报规则并发送警报通知。Prometheus Server 定期评估警报规则，并在满足条件时向 Alertmanager 发送警报通知。Alertmanager 可以根据不同的警报级别和接收者组织警报通知。
Grafana: Grafana 是一个流行的开源监控和分析平台，它可以与 Prometheus 集成以可视化和分析指标数据。Grafana 提供了强大的图表和面板，可帮助用户更好地理解和分析监控数据。
webhook 一般是是自己编写的用于对接邮件，钉钉，飞书，电话等告警的一个程序。

二、基于 Docker 部署 Prometheus Server

Prometheus 可以通过不同的方式安装部署 prometheus 监控环境，但是实际生产环境只需要根据实际需求选择其中一种方式部署即可，而且无论是使用哪一种方式安装部署的 prometheusserver，以后的使用都是一样的：

#使用 apt 或者 yum 安装

~# apt install prometheus

#基于官方提供的二进制文件安装

https://prometheus.io/download

#基于 docker 镜像直接启动或通过 docker-compose 编排

https://prometheus.io/docs/prometheus/latest/installation

#基于 operator 部署在 kubernetes 环境部署

https://github.com/prometheus-operator/kube-prometheus

docker run -d   --name prometheus  --network=host \-v /app/prome_config:/etc/prometheus \-v /data/prometheus:/data \-v /etc/hosts:/etc/hosts \-v /etc/localtime:/etc/localtime \--restart=always docker.io/prom/prometheus:v2.24.0 \--config.file=/etc/prometheus/prometheus.yml \--storage.tsdb.retention.time=30d \--storage.tsdb.path=/data --web.read-timeout=2m \--query.timeout=2m \--web.enable-lifecycle \--web.enable-admin-api \--web.listen-address=0.0.0.0:9090


#### docker exec 进入容器或者看本地挂载目录，可以看到配置文件默认会采集localhost:9090 需要增加exporter的地址和端口/prometheus $ cd /etc/prometheus//etc/prometheus $ lsconsole_libraries  consoles           prometheus.yml/etc/prometheus $ cat prometheus.yml # my global configglobal:  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  # scrape_timeout is set to the global default (10s).
# Alertmanager configurationalerting:  alertmanagers:  - static_configs:    - targets:      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.rule_files:  # - "first_rules.yml"  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:# Here it's Prometheus itself.scrape_configs:  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'    # scheme defaults to 'http'.
    static_configs:    - targets: ['localhost:9090']/etc/prometheus $

复制代码

prometheus 配置文件主要参数：：

--config.file="prometheus.yml" #指定配置文件

--web.listen-address="0.0.0.0:9090" #指定监听地址

--storage.tsdb.path="data/" #指定数存储目录

--storage.tsdb.retention.size=B, KB, MB, GB, TB, PB, EB #指定 block 大小，默认 512MB

--storage.tsdb.retention.time= #数据保存时长，默认 15 天

--query.timeout=2m #最大查询超时时间

-query.max-concurrency=20 #最大查询并发数

--web.read-timeout=5m #最大空闲超时时间

--web.max-connections=512 #最大并发连接数

--web.enable-lifecycle #启用 API 动态加载配置功能

三、基于 Docker 部署 node-exporter，并通过 Prometheus 收集 node-exporter 指标数据

docker run -d --name node_exporter \--restart=always --net="host" --pid="host" \-v "/proc:/host/proc:ro" \-v /etc/localtime:/etc/localtime  \-v "/sys:/host/sys:ro" \-v"/:/rootfs:ro" \prom/node-exporter \--path.procfs=/host/proc \--path.rootfs=/rootfs \--path.sysfs=/host/sys \--collector.filesystem.ignored-mount-points='^/(sys|proc|dev|host|etc)($$|/)'

复制代码

四、安装 Grafana 并添加 Prometheus 数据源，导入模板可以图形显示指标数据

docker run --name=grafana   \--network=host  \-v /etc/localtime:/etc/localtime  \-v /data/grafana:/var/lib/grafana  \-p 3000:3000 \-d grafana/grafana:6.7.4

复制代码

默认用户名和密码是 admin/admin

配置数据源

显示这样就表示添加成功了

可以到这个网址寻找需要的模板https://grafana.com/grafana/dashboards

复制 id 到 prometheus 里推荐：10242

五、Prometheus 的 PromQL 语句的简单使用

例如：node_load1{instance="192.168.131.134:9100"}
基于标签对指标数据进行匹配：
= :选择与提供的字符串完全相同的标签，精确匹配。
!= :选择与提供的字符串不相同的标签，去反。
=~ :选择正则表达式与提供的字符串（或子字符串）相匹配的标签。
!~ :选择正则表达式与提供的字符串（或子字符串）不匹配的标签。
#查询格式<metric name>{<label name>=<label value>, ...}
node_load1{instance="192.168.131.134:9100"}
node_load1{job="prometheus"}
node_load1{job="prometheus",instance="192.168.131.134:9100"} #精确匹配
node_load1{job="prometheus",instance!="192.168.131.134:9100"} #取反
node_load1{instance=~"192.168.131.13.*:9100$"} #包含正则且匹配
node_load1{instance!~"192.168.131.134:9100"} #包含正则且取反
对指标数据进行时间范围指定:
  s - 秒
  m - 分钟
  h - 小时
  d - 天
  w - 周
  y - 年
  #瞬时向量表达式，选择当前最新的数据
  node_memory_MemTotal_bytes{}
  #区间向量表达式，选择以当前时间为基准，查询所有节点node_memory_MemTotal_bytes指标5分钟内的数据
  node_memory_MemTotal_bytes{}[5m]
  #区间向量表达式，选择以当前时间为基准，查询指定节点node_memory_MemTotal_bytes指标5分钟内的数据
  node_memory_MemTotal_bytes{instance="172.31.1.181:9100"}[5m]
对指标数据进行数学运算：
  + 加法
  - 减法
  * 乘法
  / 除法
  % 模
  ^ 幂(N次方)
  node_memory_MemFree_bytes/1024/1024 #将内存进行单位从字节转行为兆
  node_disk_read_bytes_total{device="sda"} + node_disk_written_bytes_total{device="sda"} #计算磁盘读写数据量
  (node_disk_read_bytes_total{device="sda"} + node_disk_written_bytes_total{device="sda"}) / 1024 / 1024 #单位转换
对指标数据进行进行聚合运算：
  max() #最大值
  min() #最小值
  avg() #平均值
  计算每个节点的最大的流量值：
  max(node_network_receive_bytes_total) by (instance)
  计算每个节点最近五分钟每个device的最大流量
  max(rate(node_network_receive_bytes_total[5m])) by (device)
 sum() #求数据值相加的和(总数)
 sum(prometheus_http_requests_total)
    {} 2495 #最近总共请求数为2495次，用于计算返回值的总数(如http请求次数)
 count() #统计返回值的条数
 count(node_os_version)
    {} 2 #一共两条返回的数据，可以用于统计节点数、pod数量等
 count_values() #对value的个数(行数)进行计数,并将value赋值给自定义标签，从而成为新的label
 count_values("node_version",node_os_version) #统计不同的系统版本节点有多少
    {node_version="20.04"} 2
  abs() #返回指标数据的值
  abs(sum(prometheus_http_requests_total{handler="/metrics"}))
  absent() #如果监指标有数据就返回空，如果监控项没有数据就返回1，可用于对监控项设置告警通知(如果返回值等于1就触发告警通知)
  absent(sum(prometheus_http_requests_total{handler="/metrics"}))
  stddev() #标准差
  stddev(prometheus_http_requests_total) #5+5=10,1+9=10,1+9这一组的数据差异就大，在系统是数据波动较大，不稳定
  stdvar() #求方差
  stdvar(prometheus_http_requests_total)
  topk() #样本值排名最大的N个数据
  #取从大到小的前6个
  topk(6, prometheus_http_requests_total)
  bottomk() #样本值排名最小的N个数据
  #取从小到大的前6个
  bottomk(6, prometheus_http_requests_total)
  rate() #rate函数是专门搭配counter数据类型使用函数，rate会取指定时间范围内所有数据点，算出一组速率，然后取平均值作为结果,适合用于计算数据相对平稳的数据。
  rate(prometheus_http_requests_total[5m])
  rate(apiserver_request_total{code=~"^(?:2..)$"}[5m])
  rate(node_network_receive_bytes_total[5m])
  irate() #函数也是专门搭配counter数据类型使用函数，irate取的是在指定时间范围内的最近两个数据点来算速率，适合计算数据变化比较大的数据，显示的数据相对比较准确,所以官网文档说：irate适合快速变化的计数器（counter），而rate适合缓慢变化的计数器（counter）。
  irate(prometheus_http_requests_total[5m])
  irate(node_network_receive_bytes_total[5m])
  irate(apiserver_request_total{code=~"^(?:2..)$"}[5m])
  #by，在计算结果中，只保留by指定的标签的值，并移除其它所有的
  sum(rate(node_network_receive_packets_total{instance=~".*"}[10m])) by (instance)
  sum(rate(node_memory_MemFree_bytes[5m])) by (increase)
  #without，从计算结果中移除列举的instance,job标签，保留其它标签
  sum(prometheus_http_requests_total) without (instance,job)

复制代码

六、部署 Prometheus 联邦集群并实现指标数据收集

在另一台机器同样部署 prometheus 和 node_exporter 注意配置文件内的地址尽量写 ip 不要写 localhost，在第一台 prometheus 服务器增加采集另一台 prometheus 服务器的指标信息，这样就组成了 prometheus 的联邦集群。

  - job_name: 'prometheus-federate'    scrape_interval: 10s    honor_labels: true    metrics_path: '/federate'    params:        'match[]':             - '{job="prometheus"}'             - '{__name__=~"job:.*"}'             - '{__name__=~"node.*"}'    static_configs:             - targets:                 - '192.168.131.139:9090'

复制代码

这样就能在第一台 prometheus 上查到第二台 prometheus 的数据了，实现了把多套 prometheus 数据的集中采集展示管理。

发布于: 刚刚阅读数: 4

忙着长大#

关注

还未添加个人签名 2022-02-09 加入

还未添加个人简介

发布

暂无评论

创作场景

prometheus 架构介绍及部署使用

一、Prometheus 各组件的功能

二、基于 Docker 部署 Prometheus Server

三、基于 Docker 部署 node-exporter，并通过 Prometheus 收集 node-exporter 指标数据

四、安装 Grafana 并添加 Prometheus 数据源，导入模板可以图形显示指标数据

五、Prometheus 的 PromQL 语句的简单使用

六、部署 Prometheus 联邦集群并实现指标数据收集

忙着长大#

评论