写点什么

NVIDIA GPU 监控观测最佳实践

作者:观测云
  • 2024-07-18
    上海
  • 本文字数:8329 字

    阅读完需:约 27 分钟

NVIDIA GPU 监控观测最佳实践

1、DCGM 介绍

DCGM(Data Center GPU Manager)即数据中心 GPU 管理器,是一套用于在集群环境中管理和监视 Tesla™GPU 的工具。它包括主动健康监控,全面诊断,系统警报以及包括电源和时钟管理在内的治理策略。它可以由系统管理员独立使用,并且可以轻松地集成到 NVIDIA 合作伙伴的集群管理,资源调度和监视产品中。DCGM 简化了数据中心中的 GPU 管理,提高了资源可靠性和正常运行时间,自动化了管理任务,并有助于提高整体基础架构效率。DCGM 提供了种类丰富的 GPU 监控指标,有如下功能特性:


  • GPU 行为监控

  • GPU 配置管理

  • GPU Policy 管理

  • GPU 健康诊断

  • GPU 级别统计和线程级别统计

  • NVSwitch 配置和监控

2、使用限制

  • 节点 NVIDIA GPU 驱动版本 ≥418.87.01。如果您需要收集 GPU Profiling,则节点 NVIDIA GPU 驱动版本 ≥450.80.02。关于 GPU Profiling 的更多信息,请参见 Feature Overview

  • 节点的 NVIDIA GPU 驱动版本不能为 5XX 系列(驱动版本以 5 开头,例如:510.47.03)。


您可以通过 SSH 登录 GPU 节点,执行 nvidia-smi 命令,查看安装的 GPU 驱动版本。

3、DCGM/dcgm-exporter 安装

3.1、docker 方式

3.1.1、安装 dcgm

tips:dcgm-exporter 可以连接到现有的 dcgm 代理,本次采用新建的方式连接到 dcgm 独立容器。参考文档:点击链接


docker run -d --gpus all --cap-add SYS_ADMIN -p 5556:5555 --name dcgm nvidia/dcgm:2.2.9-ubuntu20.04
复制代码

3.1.2、安装 dcgm-exporter

docker run -d  --gpus all    --net host    --cap-add SYS_ADMIN -v /root/dcgm-exporter/dcp-metrics-included.csv:/etc/dcgmExporter/dcp-metrics-included.csv --name dcgm-exporter   nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04    -r    localhost:5556 -f /etc/dcgmExporter/dcp-metrics-included.csv
复制代码


/root/dcgm-exporter/dcp-metrics-included.csv 文件(该文件见附录 A),拷贝到宿主机该目录。

3.2、k8s 方式

将 GPU 所在 node 节点打 label。


# 节点设置 Label 标签kubectl label nodes <gpu-node-name> nvidia-gpu=monitoring# 查看节点是否设置 Label 成功kubectl get nodes --show-labels
复制代码


dcgm-metrics.yaml 见附录 B。


dcgm-exporter.yaml 见附录 C。


kubectl apply -f dcgm-metrics.yamlkubectl apply -f dcgm-exporter.yaml# 查看monitoring空间下,各资源状态kubectl -n monitoring get svc,pod -l app.kubernetes.io/name=dcgm-exporter
复制代码

4、指标暴露情况确认

调用 dcgm-exporter 接口,验证 GPU 指标获取情况;假设 172.16.0.114 为 pod/container 的 IP,显示数据如下,显示结果会根据 GPU 卡的数量不同而显示不同的记录数,如图为 8 张卡。


curl 172.16.0.114:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL


同时可以验证获取 profiling 指标情况:


curl 172.16.0.114:9400/metrics | grep DCGM_FI_PROF_SM_ACTIVE


5、安装 datakit

# 需要把token 改成观测云空间的实际token值(可在「观测云控制台」-「集成」-「Datakit」 上面获取)DK_DATAWAY="https://openway.guance.com?token=tkn_xxxxxx" bash -c "$(curl -L https://static.guance.com/datakit/install.sh)" # datakit安装成功后,开启prom采集器cd /usr/local/datakit/conf.d/promcp prom.conf.sample prom.conf
# tips:根据情况修改urls数组,如果写127回环地址,上报观测云上host显示为目标机器的hostname,如果写真实IP地址,则host显示为IP地址# 为了保证GPU指标都上报到同一个指标集,这里需要指定指标集为measurement_name = "gpu_dcgm"
# 本次使用prom采集器,当然,我们仍然可以使用servicemonitor自动发现功能。
# 重启datakitdatakit service -R
复制代码

6、展示效果

附录

附录 A

# Format# If line starts with a '#' it is considered a comment# DCGM FIELD, Prometheus metric type, help message
# ClocksDCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# TemperatureDCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# PowerDCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE# DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.# DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product)# DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violationsDCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usageDCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# VGPU License statusDCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rowsDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errorsDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errorsDCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
# Static configuration information. These appear as labels on the other metricsDCGM_FI_DRIVER_VERSION, label, Driver Version# DCGM_FI_NVML_VERSION, label, NVML Version# DCGM_FI_DEV_BRAND, label, Device Brand# DCGM_FI_DEV_SERIAL, label, Device Serial Number# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version# DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device
# DCP metricsDCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
复制代码

附录 B

# dcgm-metrics.yamlapiVersion: v1kind: ConfigMapmetadata:  name: dcgm-metrics  namespace: monitoringdata:  default-counter.csv: |    # Format    # If line starts with a '#' it is considered a comment    # DCGM FIELD, Prometheus metric type, help message        # Clocks    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).        # Temperature    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).        # Power    DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).        # PCIE    # DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.    # DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.        # Utilization (the sample period varies depending on the product)    # DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).    DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).        # Errors and violations    DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.    # DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).    # DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).    # DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).        # Memory usage    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).        # ECC    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.        # Retired pages    # DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.    # DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.        # NVLink    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.    # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.        # VGPU License status    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status        # Remapped rows    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors    DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed        # Static configuration information. These appear as labels on the other metrics    DCGM_FI_DRIVER_VERSION,        label, Driver Version    # DCGM_FI_NVML_VERSION,          label, NVML Version    # DCGM_FI_DEV_BRAND,             label, Device Brand    # DCGM_FI_DEV_SERIAL,            label, Device Serial Number    # DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version    # DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version    # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version    # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version    # DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device        # DCP metrics    DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).    DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).    DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).    # DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).    DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.    DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
复制代码

附录 C

# dcgm-exporter.yamlapiVersion: apps/v1kind: DaemonSetmetadata:  name: "dcgm-exporter"  namespace: "monitoring"      #请根据实际情况选择命名空间安装  labels:    app.kubernetes.io/name: "dcgm-exporter"    app.kubernetes.io/version: "2.4.0"spec:  updateStrategy:    type: RollingUpdate  selector:    matchLabels:      app.kubernetes.io/name: "dcgm-exporter"      app.kubernetes.io/version: "2.4.0"  template:    metadata:      labels:        app.kubernetes.io/name: "dcgm-exporter"        app.kubernetes.io/version: "2.4.0"      name: "dcgm-exporter"    spec:      containers:      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04"        env:        - name: "DCGM_EXPORTER_LISTEN"                   # 服务端口号          value: ":9400"        - name: "DCGM_EXPORTER_KUBERNETES"               # 支持Kubernetes指标映射到Pod          value: "true"        name: "dcgm-exporter"        ports:        - name: "metrics"          containerPort: 9400        resources:      #建议根据实际情况配置资源使用申请值和限制值          limits:            cpu: '200m'            memory: '256Mi'          requests:            cpu: 100m            memory: 128Mi        securityContext:      #需要给dcgm-exporter容器开启特权模式          privileged: true          runAsNonRoot: false          runAsUser: 0        volumeMounts:        - name: "pod-gpu-resources"          readOnly: true          mountPath: "/var/lib/kubelet/pod-resources"        - name: "gpu-metrics"          readOnly: true          mountPath: "/etc/dcgm-exporter"      volumes:      - name: "pod-gpu-resources"        hostPath:          path: "/var/lib/kubelet/pod-resources"      - name: "gpu-metrics"        configmap:          name: "dcgm-metrics"      nodeSelector:        nvidia-gpu: "monitoring"---kind: ServiceapiVersion: v1metadata:  name: "dcgm-exporter"  namespace: "monitoring"  labels:    app.kubernetes.io/name: "dcgm-exporter"    app.kubernetes.io/version: "2.4.0"spec:  selector:    app.kubernetes.io/name: "dcgm-exporter"    app.kubernetes.io/version: "2.4.0"  ports:  - name: "metrics"    port: 9400
复制代码


用户头像

观测云

关注

还未添加个人签名 2021-02-08 加入

云时代的系统可观测平台

评论

发布
暂无评论
NVIDIA GPU 监控观测最佳实践_gpu_观测云_InfoQ写作社区