背景:生产环境中使用Tesla P40型号的进行线上模型训练,突然收到业务方反馈某一块卡好像坏了,无法使用。经了解后,发现业务方无法使用某一块卡进行运行程序,而其他GPU卡设备均正常。本篇文章记录如何排查并修复该问题。

GPU卡异常现象

gpu卡异常测试
如上的的bandwidthTest是nvidia-cuda官方提供的测试样例,具体可以查看GPU环境的构建.当然用户也可以使用tensorflow-gpu的如下代码来测试程序是否可以识别到GPU设备:

1
2
3
4
5
import tensorflow as tf 
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
tf.test.gpu_device_name()
段错误(吐核)

问题排查

由于问题出现原因仅为服务器的其中一块卡,因此我们可以使用nvidia-smi命令对多卡之间的信息进行对比排查.

1.使用nvidia-smi -q -d PERFORMANCE查看GPU设备的性能

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## 经对比各个GPU卡设备性能够发现仅有id=3这块卡的`SW Power Cap`参数为active
# nvidia-smi -q -d PERFORMANCE -i 3
==============NVSMI LOG==============

Timestamp                           : Mon Aug 20 15:08:18 2018
Driver Version                      : 384.81

Attached GPUs                       : 8
GPU 00000000:11:00.0
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active

但是在性能参数中有Active其实并不能说明很大的问题,但看到这里其实也可以发现该卡的确是比较慢一些了,详细可以看manage-your-gpus中的监控和管理GPU Boost部分.

原文为:

If any of the GPU clocks is running at a slower speed, one or more of the above Clocks Throttle Reasons will be marked as active. The most concerning condition would be if HW Slowdown or Unknown are active, as these would most likely indicate a power or cooling issue. The remaining conditions typically indicate that the card is idle or has been manually set into a slower mode by a system administrator.

大概意思是只要GPU的时钟频率以一个比较低的速度运行的话,在Clocks Throttle Reasons列就会有一个或多个被设置为active状态。而如果HW SlowdownUnknown只要不是active就说明硬件其实还好啦,起码电源是没问题的。其他几个选项需要继续排查下是否为管理员手动设置的或GPU卡正在使用中。

由刚开始的测试程序可以看出,我们id=3的这块卡其实已经无法检测,那既没有人手动设置,也没有程序在使用该卡,说明该卡其实还是有些问题的。

至于为什么会出现SW Power Cap: Active,在gpu-boost-tesla-k40中看到如下一句话:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
When the GPU is in a lower performance (idle) state, the GPU clock is fixed. However,
when the GPU is operating in a high performance state (P0), the highest GPU
performance is typically desired. NVIDIA GPU Boost maximizes the GPU performance
by automatically raising the GPU clock when there is thermal and power headroom
available. Likewise, if the power or thermal limit is reached, the GPU clock scales down
to the next available clock setting so that the board remains below the power and
thermal limit.
NVIDIA products that support NVIDIA GPU Boost have multiple high-performance
GPU clocks defined. That is, when the GPU is operating in its high performance mode
(P0 state; determined automatically by the driver software), it has an array of GPU
clocks available.

也就是在程序运行过程中可能会导致SW Power Cap状态进行变化。

2.使用nvidia-smi -q -d ecc 查看GPU的ecc信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
nvidia-smi -i 3 -q -d ecc

==============NVSMI LOG==============

Timestamp                           : Mon Aug 20 15:34:05 2018
Driver Version                      : 384.81

Attached GPUs                       : 8
GPU 00000000:11:00.0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 524
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 524
            Double Bit
                Device Memory       : 36
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 36

发现该卡ECC ErrorsAggregateDevice Memory错误信息.

nvidia-smi有一个nvidia-smi -r 参数用来进行对GPU卡进行重置,相关说明如下:

1
2
3
4
5
6
7
    -r    --gpu-reset           Trigger reset of the GPU.
                                Can be used to reset the GPU HW state in situations
                                that would otherwise require a machine reboot.
                                Typically useful if a double bit ECC error has
                                occurred.
                                Reset operations are not guarenteed to work in
                                all cases and should be used with caution.

也就是说当a double bit ECC error出现时,gpu卡的重置是很有效的.

3.尝试使用nvidia-smi -r对异常的GPU卡进行恢复

1
2
# nvidia-smi -r -i 3
Unable to reset GPU 00000000:11:00.0 because it's being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.

意思是在对GPU卡进行重置之前,建议kill掉服务器上的所有使用GPU卡的程序,所以停止掉程序后继续执行。

临时修复

在临时关闭服务器上其他使用GPU资源的程序后,再次对id=3的GPU卡进行重置操作.

1
$ nvidia-smi -r -i 3

再次测试GPU卡检测程序

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> tf.test.gpu_device_name()
2018-08-20 14:52:27.958572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla P40
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:11:00.0
Total memory: 22.38GiB
Free memory: 22.21GiB
2018-08-20 14:52:27.958641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2018-08-20 14:52:27.958651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y
2018-08-20 14:52:27.958668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:11:00.0)
u'/gpu:0'
>>>

可以看到,重置GPU后可以正常识别到GPU设备。

善后处理

其实找了这么多,虽然临时将异常的GPU卡恢复使用了,但是对于底层具体的原因其实还有待排查,因为需要设计到cuda以及Tesla不同型号产品的配置以及参数优化调整。但为了便于问题的排查和修复以及对于业务使用的快速反应,我们需要尽快恢复资源使用。因此建议将ECC Errors进行归零操作,以便后期问题的继续排查.

将ECC Errors置零操作 CUDA_ERROR_ECC_UNCORRECTABLE

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# nvidia-smi -p 1 -i 3
Reset aggregate ECC errors to zero for GPU 00000000:11:00.0.
All done.

# nvidia-smi -i 3 -q -d ecc

==============NVSMI LOG==============

Timestamp                           : Mon Aug 20 15:50:22 2018
Driver Version                      : 384.81

Attached GPUs                       : 8
GPU 00000000:11:00.0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0

参考文章

GPU-Power-Cap
gpu-boost-tesla-k40