NVIDIA GPU 监控观测最佳实践

2024-07-18
上海
本文字数：8329 字
阅读完需：约 27 分钟

1、DCGM 介绍

DCGM（Data Center GPU Manager）即数据中心 GPU 管理器，是一套用于在集群环境中管理和监视 Tesla™GPU 的工具。它包括主动健康监控，全面诊断，系统警报以及包括电源和时钟管理在内的治理策略。它可以由系统管理员独立使用，并且可以轻松地集成到 NVIDIA 合作伙伴的集群管理，资源调度和监视产品中。DCGM 简化了数据中心中的 GPU 管理，提高了资源可靠性和正常运行时间，自动化了管理任务，并有助于提高整体基础架构效率。DCGM 提供了种类丰富的 GPU 监控指标，有如下功能特性：

GPU 行为监控
GPU 配置管理
GPU Policy 管理
GPU 健康诊断
GPU 级别统计和线程级别统计
NVSwitch 配置和监控

2、使用限制

节点 NVIDIA GPU 驱动版本 ≥418.87.01。如果您需要收集 GPU Profiling，则节点 NVIDIA GPU 驱动版本 ≥450.80.02。关于 GPU Profiling 的更多信息，请参见 Feature Overview。
节点的 NVIDIA GPU 驱动版本不能为 5XX 系列（驱动版本以 5 开头，例如：510.47.03）。

您可以通过 SSH 登录 GPU 节点，执行 nvidia-smi 命令，查看安装的 GPU 驱动版本。

3、DCGM/dcgm-exporter 安装

3.1、docker 方式

3.1.1、安装 dcgm

tips：dcgm-exporter 可以连接到现有的 dcgm 代理，本次采用新建的方式连接到 dcgm 独立容器。参考文档：点击链接

docker run -d --gpus all --cap-add SYS_ADMIN -p 5556:5555 --name dcgm nvidia/dcgm:2.2.9-ubuntu20.04

复制代码

3.1.2、安装 dcgm-exporter

docker run -d  --gpus all    --net host    --cap-add SYS_ADMIN -v /root/dcgm-exporter/dcp-metrics-included.csv:/etc/dcgmExporter/dcp-metrics-included.csv --name dcgm-exporter   nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04    -r    localhost:5556 -f /etc/dcgmExporter/dcp-metrics-included.csv

复制代码

/root/dcgm-exporter/dcp-metrics-included.csv 文件（该文件见附录 A），拷贝到宿主机该目录。

3.2、k8s 方式

将 GPU 所在 node 节点打 label。

# 节点设置 Label 标签kubectl label nodes <gpu-node-name> nvidia-gpu=monitoring# 查看节点是否设置 Label 成功kubectl get nodes --show-labels

复制代码

dcgm-metrics.yaml 见附录 B。

dcgm-exporter.yaml 见附录 C。

kubectl apply -f dcgm-metrics.yamlkubectl apply -f dcgm-exporter.yaml# 查看monitoring空间下，各资源状态kubectl -n monitoring get svc,pod -l app.kubernetes.io/name=dcgm-exporter

复制代码

4、指标暴露情况确认

调用 dcgm-exporter 接口，验证 GPU 指标获取情况；假设 172.16.0.114 为 pod/container 的 IP，显示数据如下，显示结果会根据 GPU 卡的数量不同而显示不同的记录数，如图为 8 张卡。

curl 172.16.0.114:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

同时可以验证获取 profiling 指标情况：

curl 172.16.0.114:9400/metrics | grep DCGM_FI_PROF_SM_ACTIVE

5、安装 datakit

# 需要把token 改成观测云空间的实际token值（可在「观测云控制台」-「集成」-「Datakit」 上面获取）DK_DATAWAY="https://openway.guance.com?token=tkn_xxxxxx" bash -c "$(curl -L https://static.guance.com/datakit/install.sh)" # datakit安装成功后，开启prom采集器cd /usr/local/datakit/conf.d/promcp prom.conf.sample prom.conf
# tips：根据情况修改urls数组，如果写127回环地址，上报观测云上host显示为目标机器的hostname，如果写真实IP地址，则host显示为IP地址# 为了保证GPU指标都上报到同一个指标集，这里需要指定指标集为measurement_name = "gpu_dcgm"
# 本次使用prom采集器，当然，我们仍然可以使用servicemonitor自动发现功能。
# 重启datakitdatakit service -R

复制代码

6、展示效果

附录

附录 A

# Format# If line starts with a '#' it is considered a comment# DCGM FIELD, Prometheus metric type, help message
# ClocksDCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# TemperatureDCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
# PowerDCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE# DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.# DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product)# DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
# Errors and violationsDCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usageDCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# VGPU License statusDCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rowsDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errorsDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errorsDCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
# Static configuration information. These appear as labels on the other metricsDCGM_FI_DRIVER_VERSION,        label, Driver Version# DCGM_FI_NVML_VERSION,          label, NVML Version# DCGM_FI_DEV_BRAND,             label, Device Brand# DCGM_FI_DEV_SERIAL,            label, Device Serial Number# DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version# DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version# DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device
# DCP metricsDCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.

复制代码

附录 B

# dcgm-metrics.yamlapiVersion: v1kind: ConfigMapmetadata:  name: dcgm-metrics  namespace: monitoringdata:  default-counter.csv: |    # Format    # If line starts with a '#' it is considered a comment    # DCGM FIELD, Prometheus metric type, help message        # Clocks    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).        # Temperature    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).        # Power    DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).        # PCIE    # DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.    # DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.        # Utilization (the sample period varies depending on the product)    # DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).    DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).        # Errors and violations    DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.    # DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).    # DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).    # DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).        # Memory usage    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).        # ECC    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.        # Retired pages    # DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.    # DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.        # NVLink    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.    # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.        # VGPU License status    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status        # Remapped rows    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors    DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed        # Static configuration information. These appear as labels on the other metrics    DCGM_FI_DRIVER_VERSION,        label, Driver Version    # DCGM_FI_NVML_VERSION,          label, NVML Version    # DCGM_FI_DEV_BRAND,             label, Device Brand    # DCGM_FI_DEV_SERIAL,            label, Device Serial Number    # DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version    # DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version    # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version    # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version    # DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device        # DCP metrics    DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).    DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).    DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).    # DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).    DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.    DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.

复制代码

附录 C

# dcgm-exporter.yamlapiVersion: apps/v1kind: DaemonSetmetadata:  name: "dcgm-exporter"  namespace: "monitoring"      #请根据实际情况选择命名空间安装  labels:    app.kubernetes.io/name: "dcgm-exporter"    app.kubernetes.io/version: "2.4.0"spec:  updateStrategy:    type: RollingUpdate  selector:    matchLabels:      app.kubernetes.io/name: "dcgm-exporter"      app.kubernetes.io/version: "2.4.0"  template:    metadata:      labels:        app.kubernetes.io/name: "dcgm-exporter"        app.kubernetes.io/version: "2.4.0"      name: "dcgm-exporter"    spec:      containers:      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04"        env:        - name: "DCGM_EXPORTER_LISTEN"                   # 服务端口号          value: ":9400"        - name: "DCGM_EXPORTER_KUBERNETES"               # 支持Kubernetes指标映射到Pod          value: "true"        name: "dcgm-exporter"        ports:        - name: "metrics"          containerPort: 9400        resources:      #建议根据实际情况配置资源使用申请值和限制值          limits:            cpu: '200m'            memory: '256Mi'          requests:            cpu: 100m            memory: 128Mi        securityContext:      #需要给dcgm-exporter容器开启特权模式          privileged: true          runAsNonRoot: false          runAsUser: 0        volumeMounts:        - name: "pod-gpu-resources"          readOnly: true          mountPath: "/var/lib/kubelet/pod-resources"        - name: "gpu-metrics"          readOnly: true          mountPath: "/etc/dcgm-exporter"      volumes:      - name: "pod-gpu-resources"        hostPath:          path: "/var/lib/kubelet/pod-resources"      - name: "gpu-metrics"        configmap:          name: "dcgm-metrics"      nodeSelector:        nvidia-gpu: "monitoring"---kind: ServiceapiVersion: v1metadata:  name: "dcgm-exporter"  namespace: "monitoring"  labels:    app.kubernetes.io/name: "dcgm-exporter"    app.kubernetes.io/version: "2.4.0"spec:  selector:    app.kubernetes.io/name: "dcgm-exporter"    app.kubernetes.io/version: "2.4.0"  ports:  - name: "metrics"    port: 9400

复制代码

发布于: 刚刚阅读数: 5

观测云

关注

还未添加个人签名 2021-02-08 加入

云时代的系统可观测平台

发布

暂无评论

创作场景

NVIDIA GPU 监控观测最佳实践

1、DCGM 介绍

2、使用限制

3、DCGM/dcgm-exporter 安装

3.1、docker 方式

3.1.1、安装 dcgm

3.1.2、安装 dcgm-exporter

3.2、k8s 方式

4、指标暴露情况确认

5、安装 datakit

6、展示效果

附录

附录 A

附录 B

附录 C

观测云

评论