NVIDIA GPU 监控观测最佳实践
- 2024-07-18 上海
本文字数:8329 字
阅读完需:约 27 分钟

1、DCGM 介绍
DCGM(Data Center GPU Manager)即数据中心 GPU 管理器,是一套用于在集群环境中管理和监视 Tesla™GPU 的工具。它包括主动健康监控,全面诊断,系统警报以及包括电源和时钟管理在内的治理策略。它可以由系统管理员独立使用,并且可以轻松地集成到 NVIDIA 合作伙伴的集群管理,资源调度和监视产品中。DCGM 简化了数据中心中的 GPU 管理,提高了资源可靠性和正常运行时间,自动化了管理任务,并有助于提高整体基础架构效率。DCGM 提供了种类丰富的 GPU 监控指标,有如下功能特性:
GPU 行为监控
GPU 配置管理
GPU Policy 管理
GPU 健康诊断
GPU 级别统计和线程级别统计
NVSwitch 配置和监控
2、使用限制
节点 NVIDIA GPU 驱动版本 ≥418.87.01。如果您需要收集 GPU Profiling,则节点 NVIDIA GPU 驱动版本 ≥450.80.02。关于 GPU Profiling 的更多信息,请参见 Feature Overview。
节点的 NVIDIA GPU 驱动版本不能为 5XX 系列(驱动版本以 5 开头,例如:510.47.03)。
您可以通过 SSH 登录 GPU 节点,执行 nvidia-smi 命令,查看安装的 GPU 驱动版本。
3、DCGM/dcgm-exporter 安装
3.1、docker 方式
3.1.1、安装 dcgm
tips:dcgm-exporter 可以连接到现有的 dcgm 代理,本次采用新建的方式连接到 dcgm 独立容器。参考文档:点击链接
docker run -d --gpus all --cap-add SYS_ADMIN -p 5556:5555 --name dcgm nvidia/dcgm:2.2.9-ubuntu20.04
3.1.2、安装 dcgm-exporter
docker run -d --gpus all --net host --cap-add SYS_ADMIN -v /root/dcgm-exporter/dcp-metrics-included.csv:/etc/dcgmExporter/dcp-metrics-included.csv --name dcgm-exporter nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04 -r localhost:5556 -f /etc/dcgmExporter/dcp-metrics-included.csv
/root/dcgm-exporter/dcp-metrics-included.csv 文件(该文件见附录 A),拷贝到宿主机该目录。
3.2、k8s 方式
将 GPU 所在 node 节点打 label。
# 节点设置 Label 标签kubectl label nodes <gpu-node-name> nvidia-gpu=monitoring# 查看节点是否设置 Label 成功kubectl get nodes --show-labels
dcgm-metrics.yaml 见附录 B。
dcgm-exporter.yaml 见附录 C。
kubectl apply -f dcgm-metrics.yamlkubectl apply -f dcgm-exporter.yaml# 查看monitoring空间下,各资源状态kubectl -n monitoring get svc,pod -l app.kubernetes.io/name=dcgm-exporter
4、指标暴露情况确认
调用 dcgm-exporter 接口,验证 GPU 指标获取情况;假设 172.16.0.114 为 pod/container 的 IP,显示数据如下,显示结果会根据 GPU 卡的数量不同而显示不同的记录数,如图为 8 张卡。
curl 172.16.0.114:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
同时可以验证获取 profiling 指标情况:
curl 172.16.0.114:9400/metrics | grep DCGM_FI_PROF_SM_ACTIVE
5、安装 datakit
# 需要把token 改成观测云空间的实际token值(可在「观测云控制台」-「集成」-「Datakit」 上面获取)DK_DATAWAY="https://openway.guance.com?token=tkn_xxxxxx" bash -c "$(curl -L https://static.guance.com/datakit/install.sh)" # datakit安装成功后,开启prom采集器cd /usr/local/datakit/conf.d/promcp prom.conf.sample prom.conf
# tips:根据情况修改urls数组,如果写127回环地址,上报观测云上host显示为目标机器的hostname,如果写真实IP地址,则host显示为IP地址# 为了保证GPU指标都上报到同一个指标集,这里需要指定指标集为measurement_name = "gpu_dcgm"
# 本次使用prom采集器,当然,我们仍然可以使用servicemonitor自动发现功能。
# 重启datakitdatakit service -R
6、展示效果
附录
附录 A
# Format# If line starts with a '#' it is considered a comment# DCGM FIELD, Prometheus metric type, help message
# ClocksDCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# TemperatureDCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# PowerDCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE# DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.# DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product)# DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violationsDCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usageDCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# VGPU License statusDCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rowsDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errorsDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errorsDCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
# Static configuration information. These appear as labels on the other metricsDCGM_FI_DRIVER_VERSION, label, Driver Version# DCGM_FI_NVML_VERSION, label, NVML Version# DCGM_FI_DEV_BRAND, label, Device Brand# DCGM_FI_DEV_SERIAL, label, Device Serial Number# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version# DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device
# DCP metricsDCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
附录 B
# dcgm-metrics.yamlapiVersion: v1kind: ConfigMapmetadata: name: dcgm-metrics namespace: monitoringdata: default-counter.csv: | # Format # If line starts with a '#' it is considered a comment # DCGM FIELD, Prometheus metric type, help message # Clocks DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). # Temperature DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). # Power DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). # PCIE # DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML. # DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML. DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. # Utilization (the sample period varies depending on the product) # DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). # Errors and violations DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). # Memory usage DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). # ECC # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors. # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors. # Retired pages # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. # NVLink # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors. # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors. # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries. # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors. DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes. # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload. # VGPU License status DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status # Remapped rows DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed # Static configuration information. These appear as labels on the other metrics DCGM_FI_DRIVER_VERSION, label, Driver Version # DCGM_FI_NVML_VERSION, label, NVML Version # DCGM_FI_DEV_BRAND, label, Device Brand # DCGM_FI_DEV_SERIAL, label, Device Serial Number # DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version # DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version # DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device # DCP metrics DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %). DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %). DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %). DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %). DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %). # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %). # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %). # DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %). DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second. DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
附录 C
# dcgm-exporter.yamlapiVersion: apps/v1kind: DaemonSetmetadata: name: "dcgm-exporter" namespace: "monitoring" #请根据实际情况选择命名空间安装 labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0"spec: updateStrategy: type: RollingUpdate selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" template: metadata: labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" name: "dcgm-exporter" spec: containers: - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04" env: - name: "DCGM_EXPORTER_LISTEN" # 服务端口号 value: ":9400" - name: "DCGM_EXPORTER_KUBERNETES" # 支持Kubernetes指标映射到Pod value: "true" name: "dcgm-exporter" ports: - name: "metrics" containerPort: 9400 resources: #建议根据实际情况配置资源使用申请值和限制值 limits: cpu: '200m' memory: '256Mi' requests: cpu: 100m memory: 128Mi securityContext: #需要给dcgm-exporter容器开启特权模式 privileged: true runAsNonRoot: false runAsUser: 0 volumeMounts: - name: "pod-gpu-resources" readOnly: true mountPath: "/var/lib/kubelet/pod-resources" - name: "gpu-metrics" readOnly: true mountPath: "/etc/dcgm-exporter" volumes: - name: "pod-gpu-resources" hostPath: path: "/var/lib/kubelet/pod-resources" - name: "gpu-metrics" configmap: name: "dcgm-metrics" nodeSelector: nvidia-gpu: "monitoring"---kind: ServiceapiVersion: v1metadata: name: "dcgm-exporter" namespace: "monitoring" labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0"spec: selector: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" ports: - name: "metrics" port: 9400
观测云
还未添加个人签名 2021-02-08 加入
云时代的系统可观测平台







评论