刚发现我都打好包了,下面一条命令即可
yum install datacenter-gpu-manager omg-gpu-collectd -y
然后就可以直接在grafana那边看机器的监控了。
下载collectd,编译打包
collectd 从5.8开始支持 nvidia 我们这里直接下载最新版并编译使支持
参考
https://collectd.org/download.shtml
https://developer.nvidia.com/blog/gpu-telemetry-nvidia-dcgm/
wget https://storage.googleapis.com/collectd-tarballs/collectd-5.12.0.tar.bz2
tar -xf collectd-5.12.0.tar.bz2
cd collectd-5.12.0
# 安装路径根据实际情况调整,编译的机器需要安装好cuda
./configure --prefix /app/golang/collectd/ --enable-gpu-nvidia --with-cuda=/usr/local/cuda/
make && make install
一般生产环境不止一台机器,每台都编译比较麻烦,可以根据实际情况,打一个安装包来给其它机器安装,这里给一下yum的方法
这里是用了 fpm命令 具体如何使用 google一下
build.sh
mkdir -p rpm/package
fpm -s dir -t rpm -n omg-collectd -v 1.0 \
--rpm-user root --rpm-group root \
--iteration $(date +"%Y%m%d_%H%M%S") \
-C $(pwd)/collectd \
--description "The XiaoChuanKeJi(IDC) Agent Package Created By SRE." \
-d xiaochuan-supervisor \
--pre-install $(pwd)/rpm/script/pre-install.sh \
--post-install $(pwd)/rpm/script/install.sh \
--post-uninstall $(pwd)/rpm/script/uninstall.sh \
--after-upgrade $(pwd)/rpm/script/install.sh \
--prefix /app/golang/collectd \
-p $(pwd)/rpm/package
安装dcgm
dcgm 根据实际情况安装对应版本就行
yum erase -y datacenter-gpu-manager
rpm -ivh datacenter-gpu-manager-1.7.2-1.x86_64.rpm
systemctl enable --now dcgm.servic
mkdir -p /usr/lib64/collectd/dcgm
cp -rp /usr/local/dcgm/bindings/*.py /usr/lib64/collectd/dcgm/
cp -rp /usr/local/dcgm/samples/scripts/dcgm_collectd_plugin.py /usr/lib64/collectd/dcgm/
修改下脚本中g_dcgmLibPath 路径
sed -i -e 's|\(g_dcgmLibPath =\) '"'"'/usr/lib'"'"'|\1 '"'"'/usr/lib64'"'"'|g' /usr/lib64/collectd/dcgm/dcgm_collectd_plugin.py
给types.db 增加以下 这个文件根据你安装的增加
如果influx第一次支持,在influxdb的types.db 也增加下下面的
### DCGM types
ecc_dbe_aggregate_total value:GAUGE:0:U
ecc_sbe_aggregate_total value:GAUGE:0:U
ecc_dbe_volatile_total value:GAUGE:0:U
ecc_sbe_volatile_total value:GAUGE:0:U
fb_free value:GAUGE:0:U
fb_total value:GAUGE:0:U
fb_used value:GAUGE:0:U
gpu_temp value:GAUGE:U:U
gpu_utilization value:GAUGE:0:100
mem_copy_utilization value:GAUGE:0:100
memory_clock value:GAUGE:0:U
memory_temp value:GAUGE:U:U
nvlink_bandwidth_total value:GAUGE:0:U
nvlink_recovery_error_count_total value:GAUGE:0:U
nvlink_replay_error_count_total value:GAUGE:0:U
pcie_replay_counter value:GAUGE:0:U
pcie_rx_throughput value:GAUGE:0:U
pcie_tx_throughput value:GAUGE:0:U
power_usage value:GAUGE:0:U
power_violation value:GAUGE:0:U
retired_pages_dbe value:GAUGE:0:U
retired_pages_pending value:GAUGE:0:U
retired_pages_sbe value:GAUGE:0:U
sm_clock value:GAUGE:0:U
thermal_violation value:GAUGE:0:U
total_energy_consumption value:GAUGE:0:U
xid_errors value:GAUGE:0:U
collectd.conf 增加以下
LoadPlugin python
<Plugin python>
ModulePath "/usr/lib64/collectd/dcgm"
LogTraces true
Interactive false
Import "dcgm_collectd_plugin"
</Plugin>
然后重启collected
如果日志有类似以下的内容 就说明collectd这边好了
2021-10-19 19:39:31] uc_update: Value too old: name = sgal-omg-spam-newgpu01/dcgm_collectd-GPU-73dcaf1a-e5fe-b5eb-c417-ba9d4f726761/fb_total-0; value time = 1634643561.000; last cache update = 1634643561.000;
yum clean all && yum remove omg-collectd -y && rm -f /app/supervisor/conf/conf.d/collect.conf && superctl update && yum install omg-gpu-collectd -y
yum clean all && yum remove xiaochuan-collectd -y && rm -f /app/supervisor/conf/conf.d/collect.conf && superctl update && yum install xiaochuan-gpu-collectd datacenter-gpu-manager -y
inluxdb
types.db增加以下
### DCGM types
ecc_dbe_aggregate_total value:GAUGE:0:U
ecc_sbe_aggregate_total value:GAUGE:0:U
ecc_dbe_volatile_total value:GAUGE:0:U
ecc_sbe_volatile_total value:GAUGE:0:U
fb_free value:GAUGE:0:U
fb_total value:GAUGE:0:U
fb_used value:GAUGE:0:U
gpu_temp value:GAUGE:U:U
gpu_utilization value:GAUGE:0:100
mem_copy_utilization value:GAUGE:0:100
memory_clock value:GAUGE:0:U
memory_temp value:GAUGE:U:U
nvlink_bandwidth_total value:GAUGE:0:U
nvlink_recovery_error_count_total value:GAUGE:0:U
nvlink_replay_error_count_total value:GAUGE:0:U
pcie_replay_counter value:GAUGE:0:U
pcie_rx_throughput value:GAUGE:0:U
pcie_tx_throughput value:GAUGE:0:U
power_usage value:GAUGE:0:U
power_violation value:GAUGE:0:U
retired_pages_dbe value:GAUGE:0:U
retired_pages_pending value:GAUGE:0:U
retired_pages_sbe value:GAUGE:0:U
sm_clock value:GAUGE:0:U
thermal_violation value:GAUGE:0:U
total_energy_consumption value:GAUGE:0:U
xid_errors value:GAUGE:0:U
查看有没有以下表 查一下数据 有的话说明数据已经进来了
dcgm_collectd_value
grafana
创建一个dashboard 和一些panel
也可以下载 GPU-DASHBOARD.json 导入即可


不知道川皇能不能传上去