diff --git a/README.md b/README.md index 6820c02..e3ca623 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ gpu-mon 是[open-falcon](http://open-falcon.com/)用于监控GPU状态的一个 ### 监控项 -1. 详细的监控项说明可以参考[metric](https://github.com/open-falcon/gpu-mon/metric)文件,其中常用的一些监控项说明如下: +1. 详细的监控项说明可以参考[metrics](https://github.com/open-falcon/gpu-mon/blob/master/metrics)文件,其中常用的一些监控项说明如下: ```plain GPUUtils GPU 使用率 (%) @@ -27,13 +27,13 @@ gpu-mon 是[open-falcon](http://open-falcon.com/)用于监控GPU状态的一个 ### 1. 相关依赖 -1. 安装dcgm(1.4.2版本)并开启nv-hostengine进程 +1. 安装DCGM并开启nv-hostengine进程 2. 目前能够支持DCGM 1.4.2版本全部功能的GPU型号包括: - - K80及K80以后的Tesla GPU - - Maxwell及更新的非Tesla GPU + - K80及K80以后的Tesla GPU + - Maxwell及更新的非Tesla GPU - 关于 Dcgm支持的GPU型号及DCGM安装可以参考[(DCGM) NVIDIA Data Center GPU Manager](https://developer.nvidia.com/data-center-gpu-manager-dcgm) -3. 目前插件已测试支持的GPU型号包括:v100、p4、p40。 + 关于Dcgm支持的GPU型号及DCGM安装可以参考[(DCGM) NVIDIA Data Center GPU Manager](https://developer.nvidia.com/data-center-gpu-manager-dcgm) +3. 目前插件已测试支持的GPU型号包括:v100、p4、p40,测试使用的DCGM版本为1.4.2。 ### 2. 安装及使用 diff --git a/metrics b/metrics index 7892343..24bd305 100644 --- a/metrics +++ b/metrics @@ -13,16 +13,16 @@ Tx MB PCIe Tx utilization information Replays PCIe replay counter Performance Performance state (P-State) 0-15. 0=highest FanSpeed % Fan speed for the device in percent 0-100 -PowerUsed W Power usage for the device in Watts +PowerUsed W Power usage for the device in Watts DeviceTemperature °C Current temperature readings for the device, in degrees C MemTemperature °C Memory temperature for the device SlowdownTemperature °C Slowdown temperature for the device ShutdownTemperature °C Shutdown temperature for the device Modules -PowerCurrentLimit W Current Power limit for the device -PowerMinManLimit W Minimum power management limit for the device -PowerMaxManLimit W Maximum power management limit for the device -PowerDefaultManLimit W Default power management limit for the device -PowerEnforcedLimit W Effective power limit that the driver enforces after taking into account all limiters +PowerCurrentLimit W Current Power limit for the device +PowerMinManLimit W Minimum power management limit for the device +PowerMaxManLimit W Maximum power management limit for the device +PowerDefaultManLimit W Default power management limit for the device +PowerEnforcedLimit W Effective power limit that the driver enforces after taking into account all limiters PowerViolationTime W Power Violation time in usec FBtotal MB Total Frame Buffer of the GPU in MB FBfree MB Free Frame Buffer in MB @@ -39,4 +39,4 @@ DeviceMemSBErrors Device memory single bit volatile ECC errors DeviceMemDBErrors Device memory double bit volatile ECC errors RegisterSBErrors Register file single bit volatile ECC errors RegisterDBErrors Register file double bit volatile ECC errors -DcgmSupported supported 1, not supported -1 \ No newline at end of file +DcgmSupported Support Dcgm 1, not support -1 \ No newline at end of file