云原生架构 - 可观测性之 Prometheus 服务自动发现
一、基于 kubernetes-sd-configs 的动态服务自发现
1.1 定义需要基于 kubernetes-sd-configs 自发现的 Exporter应用,添加 prometheus Annotation labels
apiVersion: v1kind: Servicemetadata: name: kafka-message-handler-svc annotations: prometheus.io/scrape: "true" # Annotation,用于自发现 prometheus.io/port: "9080" prometheus.io/path: "/prometheus"spec: type: NodePort selector: app: kafka-message-handler ports: - name: kafka-message-handler-svc-port protocol: TCP port: 9080 targetPort: 9080# nodePort: 30287
1.2 配置基于 kubernetes-sd-configs 自发现的 Scrape 数据 pull 配置
新增prometheus在Kubernetes下的自动服务发现 prometheus-additional.yaml
- job_name: 'kubernetes-service' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name- job_name: 'kubernetes-pod' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: 'external-metrics' static_configs: - targets: ['10.157.227.195:9080'] metrics_path: '/prometheus'
创建新的 secret additional-configs 从文件 prometheus-additional.yaml
# kubectl delete secret generic additional-configs -n monitoringkubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
更新 Prometheus 部署文件, 新增 additionalScrapeConfigs:> name: additional-configs; key: prometheus-additional.yaml
apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: labels: prometheus: k8s name: k8s namespace: monitoringspec: retention: 30d alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web storage: volumeClaimTemplate: spec: storageClassName: prometheus-data-db resources: requests: storage: 30Gi baseImage: quay.mirrors.ustc.edu.cn/prometheus/prometheus nodeSelector: kubernetes.io/os: linux podMonitorNamespaceSelector: {} podMonitorSelector: {} replicas: 1 resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.11.0
1.3 测试自发现及相应metrics数据抓取
更新 Prometheus 安装文件,必要的时候可以重启pod
kubectl apply -f prometheus-prometheus.yaml# 强制重启 kubectl replace --force -f prometheus-prometheus.yaml or kubectl delete pod -n monitoring prometheus-k8s-0
重启后,在 prometheus 控制台的 status > Targets 及 Service Discovery 没有发现Job kubernetes-service 及 kubernetes-pod,external-metrics Job 出现了,实际上在 Status > Configuration 下面发现 additional-configs 的三个配置是存在的。
如图:
1.4 自发现不生效问题的排查分析
排查分析prometheus日志,发现clusterRole权限不够,日志如下:
level=error ts=2020-12-16T08:58:22.212Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:22.221Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:23.193Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:23.193Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:23.215Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:23.223Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:24.195Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:24.196Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:24.217Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:24.225Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:25.198Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:25.199Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:25.220Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"level=error ts=2020-12-16T08:58:25.228Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:283: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: prometheus-k8srules:- apiGroups: - "" resources: - nodes/metrics verbs: - get- nonResourceURLs: - /metrics verbs: - get
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: prometheus-k8srules:- apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch- apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get- nonResourceURLs: - /metrics verbs: - get
level=info ts=2020-12-16T09:00:03.638Z caller=compact.go:495 component=tsdb msg="write block" mint=1608098400000 maxt=1608105600000 ulid=01ESNCE6T4BJKR81BX12QZFS6S duration=1.393342928slevel=info ts=2020-12-16T09:00:03.884Z caller=head.go:586 component=tsdb msg="head GC completed" duration=129.317903mslevel=info ts=2020-12-16T09:00:05.687Z caller=head.go:656 component=tsdb msg="WAL checkpoint complete" first=26 last=29 duration=1.80239668slevel=info ts=2020-12-16T09:00:07.255Z caller=compact.go:440 component=tsdb msg="compact blocks" count=3 mint=1608076800000 maxt=1608098400000 ulid=01ESNCEA9B4H06MPDCBDWJQJS5 sources="[01ESMQV1240989FW9H7XRD4JG2 01ESMYPRA4Q94FDJM4SWY5E07K 01ESN6CZGGPRFGWHJC49BYKB4F]" duration=1.451881021s
二、使用 CRD ServiceMonitor 来生成 kubernetes-sd-configs 的配置
示例如下
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: labels: app: cloudnativelabcom release: prometheus-operator name: cloudnativelabcom namespace: monitoringspec: endpoints: - interval: 15s port: http-metrics namespaceSelector: matchNames: - default selector: matchLabels: app: cloudnativelabcomapp
实际上,很多的服务是使用 ServiceMonitor 暴露给 Prometheus 抓取的,如下
# kubectl get servicemonitor -ANAMESPACE NAME AGEmonitoring alertmanager 27hmonitoring coredns 27hmonitoring grafana 27hmonitoring kube-apiserver 27hmonitoring kube-controller-manager 27hmonitoring kube-scheduler 27hmonitoring kube-state-metrics 27hmonitoring kubelet 27hmonitoring nginx 4h34mmonitoring node-exporter 27hmonitoring prometheus 27hmonitoring prometheus-operator 27hmonitoring prometheus-pushgateway 27h
- job_name: monitoring/cloudnativelabcom/0 honor_timestamps: true scrape_interval: 15s scrape_timeout: 10s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_service_label_app] separator: ; regex: cloudnativelabcomapp replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: http-metrics replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) target_label: pod replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - separator: ; regex: (.*) target_label: endpoint replacement: http-metrics action: replace
三、手工更新
查看当前prometheus 配置(同Prometheus控制台,Status > Configuration)
kubectl get secret -n monitoring prometheus-k8s -o json | jq -r '.data."prometheus.yaml.gz"' | base64 -d | gzip -d
计划通过从 secret base64解码,解压,导出prometheus.yaml, 让后修改后在压缩,base64编码,edit secret 更新保存,相关实施过程脚本示例如下:
kubectl get secret -n monitoring prometheus-k8s -o json | jq -r '.data."prometheus.yaml.gz"' | base64 -d | gzip -d > prometheus.yaml## 修改 vim prometheus.yaml 文件,比如修改既有的配置项,默认拉取时间,比如修改既有的 kubernetes_sd_configs 配置项gzip prometheus.yamlbase64 prometheus.yaml.gz -w 0 ## -w 0 表示不换行,不成块kubectl edit secret -n monitoring prometheus-k8s ## copy 之前 base64 prometheus.yaml.gz -w 0 编码后的值,替换 prometheus.yaml.gz: 的值## 保存后可再次查看 kubectl get secret -n monitoring prometheus-k8s -o json | jq -r '.data."prometheus.yaml.gz"' | base64 -d | gzip -d# 必要时重启生效 kubectl replace --force -f prometheus-prometheus.yaml or kubectl delete pod -n monitoring prometheus-k8s-0
版权声明: 本文为 InfoQ 作者【云原生实验室】的原创文章。
原文链接:【http://xie.infoq.cn/article/7d78573bc2df58004347511a4】。文章转载请联系作者。
云原生实验室
机器应该工作,人类应该思考。 2018.03.14 加入
云原生布道师,CloudNativeLab.COM 社区创始人
评论