写点什么

云原生架构 - 可观测性之 Prometheus 服务自动发现

发布于: 2020 年 12 月 16 日
云原生架构-可观测性之 Prometheus 服务自动发现

一、基于 kubernetes-sd-configs 的动态服务自发现



1.1 定义需要基于 kubernetes-sd-configs 自发现的 Exporter应用,添加 prometheus Annotation labels



apiVersion: v1
kind: Service
metadata:
name: kafka-message-handler-svc
annotations:
prometheus.io/scrape: "true" # Annotation,用于自发现
prometheus.io/port: "9080"
prometheus.io/path: "/prometheus"
spec:
type: NodePort
selector:
app: kafka-message-handler
ports:
- name: kafka-message-handler-svc-port
protocol: TCP
port: 9080
targetPort: 9080
# nodePort: 30287



1.2 配置基于 kubernetes-sd-configs 自发现的 Scrape 数据 pull 配置



新增prometheus在Kubernetes下的自动服务发现 prometheus-additional.yaml

- job_name: 'kubernetes-service'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-pod'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'external-metrics'
static_configs:
- targets: ['10.157.227.195:9080']
metrics_path: '/prometheus'



创建新的 secret additional-configs 从文件 prometheus-additional.yaml

# kubectl delete secret generic additional-configs -n monitoring
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring



更新 Prometheus 部署文件, 新增 additionalScrapeConfigs:> name: additional-configs; key: prometheus-additional.yaml

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
retention: 30d
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
storage:
volumeClaimTemplate:
spec:
storageClassName: prometheus-data-db
resources:
requests:
storage: 30Gi
baseImage: quay.mirrors.ustc.edu.cn/prometheus/prometheus
nodeSelector:
kubernetes.io/os: linux
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
replicas: 1
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.11.0



1.3 测试自发现及相应metrics数据抓取



更新 Prometheus 安装文件,必要的时候可以重启pod

kubectl apply -f prometheus-prometheus.yaml

# 强制重启 kubectl replace --force -f prometheus-prometheus.yaml or kubectl delete pod -n monitoring prometheus-k8s-0



重启后,在 prometheus 控制台的 status > Targets 及 Service Discovery 没有发现Job kubernetes-service 及 kubernetes-pod,external-metrics Job 出现了,实际上在 Status > Configuration 下面发现 additional-configs 的三个配置是存在的。



如图:



1.4 自发现不生效问题的排查分析



排查分析prometheus日志,发现clusterRole权限不够,日志如下:

level=error ts=2020-12-16T08:58:22.212Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:22.221Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:23.193Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:23.193Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:23.215Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:23.223Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:24.195Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:24.196Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:24.217Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:24.225Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:25.198Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:25.199Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:25.220Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
level=error ts=2020-12-16T08:58:25.228Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"



apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get



apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get





level=info ts=2020-12-16T09:00:03.638Z caller=compact.go:495 component=tsdb msg="write block" mint=1608098400000 maxt=1608105600000 ulid=01ESNCE6T4BJKR81BX12QZFS6S duration=1.393342928s
level=info ts=2020-12-16T09:00:03.884Z caller=head.go:586 component=tsdb msg="head GC completed" duration=129.317903ms
level=info ts=2020-12-16T09:00:05.687Z caller=head.go:656 component=tsdb msg="WAL checkpoint complete" first=26 last=29 duration=1.80239668s
level=info ts=2020-12-16T09:00:07.255Z caller=compact.go:440 component=tsdb msg="compact blocks" count=3 mint=1608076800000 maxt=1608098400000 ulid=01ESNCEA9B4H06MPDCBDWJQJS5 sources="[01ESMQV1240989FW9H7XRD4JG2 01ESMYPRA4Q94FDJM4SWY5E07K 01ESN6CZGGPRFGWHJC49BYKB4F]" duration=1.451881021s



二、使用 CRD ServiceMonitor 来生成 kubernetes-sd-configs 的配置



示例如下

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: cloudnativelabcom
release: prometheus-operator
name: cloudnativelabcom
namespace: monitoring
spec:
endpoints:
- interval: 15s
port: http-metrics
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app: cloudnativelabcomapp



实际上,很多的服务是使用 ServiceMonitor 暴露给 Prometheus 抓取的,如下

# kubectl get servicemonitor -A
NAMESPACE NAME AGE
monitoring alertmanager 27h
monitoring coredns 27h
monitoring grafana 27h
monitoring kube-apiserver 27h
monitoring kube-controller-manager 27h
monitoring kube-scheduler 27h
monitoring kube-state-metrics 27h
monitoring kubelet 27h
monitoring nginx 4h34m
monitoring node-exporter 27h
monitoring prometheus 27h
monitoring prometheus-operator 27h
monitoring prometheus-pushgateway 27h



- job_name: monitoring/cloudnativelabcom/0
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
separator: ;
regex: cloudnativelabcomapp
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: http-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: http-metrics
action: replace



三、手工更新



查看当前prometheus 配置(同Prometheus控制台,Status > Configuration)

kubectl get secret -n monitoring prometheus-k8s -o json | jq -r '.data."prometheus.yaml.gz"' | base64 -d | gzip -d



计划通过从 secret base64解码,解压,导出prometheus.yaml, 让后修改后在压缩,base64编码,edit secret 更新保存,相关实施过程脚本示例如下:

kubectl get secret -n monitoring prometheus-k8s -o json | jq -r '.data."prometheus.yaml.gz"' | base64 -d | gzip -d > prometheus.yaml
## 修改 vim prometheus.yaml 文件,比如修改既有的配置项,默认拉取时间,比如修改既有的 kubernetes_sd_configs 配置项
gzip prometheus.yaml
base64 prometheus.yaml.gz -w 0 ## -w 0 表示不换行,不成块
kubectl edit secret -n monitoring prometheus-k8s ## copy 之前 base64 prometheus.yaml.gz -w 0 编码后的值,替换 prometheus.yaml.gz: 的值
## 保存后可再次查看 kubectl get secret -n monitoring prometheus-k8s -o json | jq -r '.data."prometheus.yaml.gz"' | base64 -d | gzip -d

# 必要时重启生效 kubectl replace --force -f prometheus-prometheus.yaml or kubectl delete pod -n monitoring prometheus-k8s-0



发布于: 2020 年 12 月 16 日阅读数: 25
用户头像

机器应该工作,人类应该思考。 2018.03.14 加入

云原生布道师,CloudNativeLab.COM 社区创始人

评论

发布
暂无评论
云原生架构-可观测性之 Prometheus 服务自动发现