GPU环境下玩转Docker(一)

背景：

随着大数据、人工智能以及机器学习等技术的发展，CPU计算资源已经不能满足很多计算场景，而随着硬件技术的发展，越来越多的人工智能以及机器学习领域开始使用GPU进行计算任务。而GPU环境以及具体的应用方式又给真正做人工智能相关的同学造成了很多困扰，本系列文章将分为三篇，将介绍如何搭建部署GPU环境，使用Docker进行管理GPU容器，使用Kubernetes来调度GPU容器。

从GPU到GPGPU CPU与GPU

GPU准备

CUDA安装部署

CUDA Toolkit下载页面

英伟达中文官网

GPU驱动安装

注意：由于GPU需要在宿主机上安装相关驱动才能够被用户态的程序所识别，所以需要先安装CUDA

参考上述的CUDA安装部署

系统需求

想要在系统上使用CUDA，必须安装如下依赖：

CUDA-capable CPU
一个特定版本的gcc编译器以及相关工具链
NVIDIA CUDA Toolkit

在linux X86_64架构平台上建议的配置：

linux发行版	内核版本	GCC	GLIBC	ICC	PGI	XLC	CLANG
RHEL 7.X	3.10	4.8.5	2.17	17.0	17.1	NO	3.9
Centos 7.X	3.10	4.8.5	2.17	17.0	17.1	NO	3.9
RHEL 6.X	2.6.32	4.4.7	2.12	17.0	17.1	NO	3.9
Centos 6.X	2.6.32	4.4.7	2.12	17.0	17.1	NO	3.9
Fedora 25	4.8.8	6.2.1	2.24-3	17.0	17.1	NO	3.9
OpenSUSE Leap 42.2	4.4.27	4.8	2.22	17.0	17.1	NO	3.9
SLES 12 SP2	4.4.21	4.8.5	2.22	17.0	17.1	NO	3.9
Ubuntu 17.04	4.9.0	6.3.0	2.24-3	17.0	17.1	NO	3.9
Ubuntu 16.04	4.4	5.3.1	2.23	17.0	17.1	NO	3.9

安装前准备

在安装CUDA Toolkit和驱动之前，需要在GPU主机上执行相关的操作：

检测系统中有CUDA-capable GPU卡
检测系统是否是上述列表中支持的linux发行版本
检测系统中是否安装了依赖的gcc编译器
检测系统中是否安装了正确的内核头文件以及开发包
下载NVIDIA CUDA Toolkit
处理一些安装过程中的冲突问题

2.1 检测是否有一个CUDA-Capable GPU

如下显示当前主机上支持并有四个GPU设备

1
2
3
4
5


sh-4.2# lspci | grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation Device 17fd (rev a1)
05:00.0 3D controller: NVIDIA Corporation Device 17fd (rev a1)
06:00.0 3D controller: NVIDIA Corporation Device 17fd (rev a1)
07:00.0 3D controller: NVIDIA Corporation Device 17fd (rev a1)

注意：如果使用上述命令没有任何输出，那需要更新你的PCI 硬件数据库，然后再次执行；

安装nvidia-linux驱动: 注意: 用户可以在nvidia-driver页面选择相应型号相应环境的nvidia驱动，并下载rpm格式的驱动程序。同时用户也可以在nvidia-driver-run页面下载指定版本的二进制驱动程序。通常情况下，推荐使用后者安装nvidia的驱动，但是前者的rpm包中默认也包含了对应的cuda版本，因此rpm包方式更适合傻瓜式安装，而使用二进制方式用户可以自行安装cuda版本。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


sh-4.2# wget -O NVIDIA-Linux-x86_64-375.39.run https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run
sh-4.2# chmod a+x NVIDIA-Linux-x86_64-375.39.run
# 选择安静模式安装
sh-4.2# ./NVIDIA-Linux-x86_64-375.39.run -s

# 可能会遇到的问题:
## 1. 提示kernel-source相关的问题 (原因是kernel和kernel-devel包的版本不一致导致无法检测到内核相关库文件)
# 重新安装内核相关文件，保证内核头文件和库文件版本一致
sh-4.2# yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y 
# 卸载错误版本的内核库文件
sh-4.2# rpm -qa | grep kernel-devel
kernel-devel-3.10.0-693.el7.x86_64
kernel-devel-3.10.0-862.el7.x86_64
sh-4.2# rpm -e kernel-devel-3.10.0-862.el7.x86_64
sh-4.2# rpm -qa | grep kernel-devel
kernel-devel-3.10.0-693.el7.x86_64
sh-4.2# ./NVIDIA-Linux-x86_64-396.45.run -s
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.45..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## 2. 提示如下
ERROR: The Nouveau kernel driver is currently in use by your system.  This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.

# 在centos7中需要禁止一个内核模块Nouveau
sh-4.2# lsmod | grep nouveau
nouveau              1622010  0
video                  24520  1 nouveau
mxm_wmi                13021  1 nouveau
drm_kms_helper        159169  2 mgag200,nouveau
ttm                    99345  2 mgag200,nouveau
drm                   370825  5 ttm,drm_kms_helper,mgag200,nouveau
i2c_algo_bit           13413  3 igb,mgag200,nouveau
i2c_core               40756  8 drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,mgag200,i2c_algo_bit,nouveau
wmi                    19070  2 mxm_wmi,nouveau

# 临时卸载该模块
sh-4.2# rmmod nouveau
sh-4.2# ./NVIDIA-Linux-x86_64-375.39.run -s

查看GPU显卡信息:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


sh-4.2# nvidia-smi
Wed Oct 18 11:58:03 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 0000:04:00.0     Off |                    0 |
| N/A   23C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           On   | 0000:05:00.0     Off |                    0 |
| N/A   26C    P8    17W / 250W |      0MiB / 11443MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40 24GB      On   | 0000:06:00.0     Off |                    0 |
| N/A   22C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40           On   | 0000:07:00.0     Off |                    0 |
| N/A   23C    P8    16W / 250W |      0MiB / 11443MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# Temp 标识GPU设备的温度
# Memory-Usage 表示内存使用率
# GPU-Util 表示GPU使用率

注意：上述输出也可以看出该系统上有4块GPU设备，使用Tesla M40型号。其中分别有两块卡24G内存，两块是12G内存，分别处于两个PCIE总线上

注意：宿主机内存为256G，cpu为Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz 开启超线程后为24颗逻辑cpu(两颗6核心的cpu开启了超线程)

注意：如果你的GPU卡是NVIDIA的，并且是在http://developer.nvidia.com/cuda-gpus中可用查看到的，那么你的GPU就是 CUDA-capable

点击查看不同型号的GPU的算力。

2.2 检测Linux的架构和操作系统版本

因为CUDA开发工具只支持一些指定发行版本的linux，需要用户查看操作系统的架构以及发行版本。可以在CUDA Toolkit发布版本的中查看支持的linux版本。

1
2
3


# uname -m && cat /etc/redhat-release
x86_64
CentOS Linux release 7.2.1511 (Core)

2.3 检测系统是否安装了gcc

当使用CUDA Toolkit进行开发的时候，gcc编译器是必须需要的。一般情况下linux主机都会安装了gcc编译器，但是为确保之后的操作不会出现大问题，建议检查下gcc以及版本是否为对应的版本。

1
2


# gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)

2.4 检测系统是否有正确的内核头文件以及一些开发包是否安装

CUDA驱动需要内核头文件和开发工具包来保证驱动程序的安装以及rebuilt，比如你的内核版本为3.17.4-301，那么3.17.4-301的内核头文件以及相关的开发包也必须安装。

当驱动程序的安装过程没有进行包的验证，在使用RPM或者DEB包安装驱动的时候如果系统上没有安装正确的软件包，它将会尝试去安装内核头文件以及开发工具包。但是通常情况下，这种安装会默认去寻找仓库中最新版本的软件包，可能会导致内核版本的不匹配等问题。因此，在安装CUDA驱动之前，最好手动确认内核头文件的版本以及开发工具包的安装。

在centos系统上可以执行如下命令：

1
2
3
4


# uname -r
3.10.0-327.el7.x86_64

# yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

2.5 选择安装方式

官方有两种方式去安装： distribution-specific packages (RPM and Deb packages)和distribution-independent package (runfile packages)。其中前者对接了linux发行版原生的包管理系统，是强烈建议的一种安装方式。

2.6 下载NVIDIA CUDA Toolkit

下载地址根据当前系统的基础状况来选择相对应的版本。 NVIDIA CUDA Toolkit

可以看到安装类型支持两种方式runfile和rpm,其中rpm方式又分为local和network方式，由于我们的宿主机不能直接访问外网，先使用rpm（local）方式进行安装下载。

下载完成之后需要使用md5sum进行文件验证，以保证最终的包一致性。官方提供的checksums文件被损坏，暂时无法检验。

2.7 处理安装冲突的一些方法

在安装CUDA之前，任何可能冲突的安装包都需要被卸载。以下为相关细节。

使用如下方式去卸载相关的冲突包。

卸载runfile方式的Toolkit :

1

/usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl

卸载runfile方式的Driver：

1

/usr/bin/nvidia-uninstall

卸载RPM/Deb方式安装的包：

1
2
3
4


$ yum remove <package_name>                      # Redhat/CentOS
$ dnf remove <package_name>                      # Fedora
$ zypper remove <package_name>                   # OpenSUSE/SLES
$ apt-get --purge remove <package_name>          # Ubuntu

安装包管理程序(Package Manager)

快速安装指南

3.1 在Redhat/CentOS上安装

1.执行2中的操作
2.确认DKMS依赖 NVIDIA驱动的RPM包会依赖一些额外的包，比如说DKMS和libvdpau,这些包在系统默认的仓库中是不包含的，只存在与第三方镜像仓库，比如EPEL,因此在安装驱动之前，必须将第三方源添加到本地的仓库中，否则缺失依赖会阻止安装继续进行。
3.如果需要，自定义xorg.conf文件驱动会依赖一个自动生成的xorg.conf文件 /etc/X11/xorg.conf，该文件可能会影响驱动的正常工作，可以删除该文件，或者添加/etc/X11/xorg.conf.d/00-nvidia.conf的内容到xorg.conf文件中。
4.安装meta-data仓库

1

# rpm --install cuda-repo-<distro>-<version>.<architecture>.rpm

5.清除仓库缓存

1

# yum clean expire-cache

6.安装CUDA

1

# yum install cuda

如果i686的libvdpau包安装失败，可以尝试以下步骤来修复该问题。

1
2


# yumdownloader libvdpau.i686
# sudo rpm -U --oldpackage libvdpau*.rpm

7.如果需要，添加libcuda.so的软连接

libcuda.so库文件被安装在/usr/lib{,64}/nvidia目录，如果已运行的项目需要使用libcuda.so文件，可以添加一个软连接到/usr/lib{,64}目录。

8.执行安装后操作

3.2 包管理器的额外功能

cuda核心包

安装后操作

4.1 必须执行的操作

一些操作行为必须在安装后并且在使用CUDA Toolkit和Driver之前去执行。

4.1.1 环境设置

PATH环境变量必须包含/usr/local/cuda-8.0/bin

1

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}

注意：在使用runfile方式安装的时候，动态链接库LD_LIBRARY_PATH的环境变量需要包含/usr/local/cuda-8.0/lib64.

1
2


$ export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

4.2 强烈推荐的操作

4.2.1 安装可写的示例程序

为了修改，编译以及运行样品，样品程序必须也可写权限进行安装，安装脚本如下：

1

# cuda-install-samples-8.0.sh <dir>

该脚本会创建一个/usr/local/cuda/samples的只读拷贝，需要将拷贝的内容改为可写。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# cuda-install-samples-8.0.sh /export/biaoge/cuda-samples

# tree -L 2 /export/biaoge/cuda-samples/
/export/biaoge/cuda-samples/
└── NVIDIA_CUDA-8.0_Samples
    ├── 0_Simple
    ├── 1_Utilities
    ├── 2_Graphics
    ├── 3_Imaging
    ├── 4_Finance
    ├── 5_Simulations
    ├── 6_Advanced
    ├── 7_CUDALibraries
    ├── bin
    ├── common
    ├── EULA.txt
    ├── Makefile
    └── uninstall_cuda_samples_8.0.pl

4.2.2 验证所有的安装

在继续操作之前，验证一下CUDA Toolkit能够识别到正确的GPU硬件设备是非常重要的。因此这里需要编译一些样品程序来进行检验。

(1)验证驱动版本如果安装了确定，需要验证下加载驱动的版本是否正确，如果没有安装驱动或者没有用过内核模块来加载，可以暂时跳过该步骤。

当驱动被加载后，可以通过如下命令查看到驱动的版本

1
2
3


# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.39  Tue Jan 31 20:47:00 PST 2017
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)

(2) 编译样品程序 CUDA Toolkit的版本可以使用nvcc --version/-V查看，该命令运行编译驱动来编译CUDA程序，底层其实调用了gcc编译器来编译c代码，使用 NVIDIA PTX编译器来调用CUDA代码。

1
2
3
4
5


# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

NVIDIA CUDA Toolkit在源文件中包含了一些示例程序，用户可以通过修改~/NVIDIA_CUDA-8.0_Samples并执行make来编译这些示例程序。编译的二进制文件将存放在~/NVIDIA_CUDA-8.0_Samples/bin

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# cd /export/biaoge/cuda-samples/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery

#编译生成deviceQuery二进制文件，在(3)中需要验证环境
# make

# cd /export/biaoge/cuda-samples/NVIDIA_CUDA-8.0_Samples/1_Utilities/bandwidthTest

#编译生成bandwidthTest二进制文件，在(3)中用来验证环境
# make

# ll ../bandwidthTest/bandwidthTest
-rwxr-xr-x 1 root root 603420 Oct 18 16:05 ../bandwidthTest/bandwidthTest
# ll ../deviceQuery/deviceQuery
-rwxr-xr-x 1 root root 582882 Oct 18 16:44 ../deviceQuery/deviceQuery

(3) 运行二进制文件

编译完成之后，在~/NVIDIA_CUDA-8.0_Samples下对应目录下运行deviceQuery.如果CUDA程序被正确安装和配置，deviceQuery的输出应该看起来如下所示。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153


# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla M40 24GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 22940 MBytes (24054136832 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla M40"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 11443 MBytes (11998855168 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 5 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Tesla M40 24GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 22940 MBytes (24054136832 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "Tesla M40"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 11443 MBytes (11998855168 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 7 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla M40 24GB (GPU0) -> Tesla M40 (GPU1) : Yes
> Peer access from Tesla M40 24GB (GPU0) -> Tesla M40 24GB (GPU2) : Yes
> Peer access from Tesla M40 24GB (GPU0) -> Tesla M40 (GPU3) : Yes
> Peer access from Tesla M40 (GPU1) -> Tesla M40 24GB (GPU0) : Yes
> Peer access from Tesla M40 (GPU1) -> Tesla M40 24GB (GPU2) : Yes
> Peer access from Tesla M40 (GPU1) -> Tesla M40 (GPU3) : Yes
> Peer access from Tesla M40 24GB (GPU2) -> Tesla M40 24GB (GPU0) : Yes
> Peer access from Tesla M40 24GB (GPU2) -> Tesla M40 (GPU1) : Yes
> Peer access from Tesla M40 24GB (GPU2) -> Tesla M40 (GPU3) : Yes
> Peer access from Tesla M40 (GPU3) -> Tesla M40 24GB (GPU0) : Yes
> Peer access from Tesla M40 (GPU3) -> Tesla M40 (GPU1) : Yes
> Peer access from Tesla M40 (GPU3) -> Tesla M40 24GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla M40 24GB, Device1 = Tesla M40, Device2 = Tesla M40 24GB, Device3 = Tesla M40
Result = PASS

官方文档中的示例程序

注意：如果CUDA-capable设备和CUDA 驱动都已经成功安装，但是deviceQuery程序报告没有CUDA-capable设备在线，这个可能是/dev/nvidia*相关文件丢失或者没有相应的权限。

可以使用setenforce 0关闭SELinux后再进行测试。

运行bandwidthTest程序来确认系统和CUDA-capable设备可以正常通信，输出结果如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla M40 24GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11710.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12464.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			210964.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

官方文档中的示例程序

上图表示测试通过，如果测试没有通过，可以确认下系统上CUDA-capable NVIDIA GPU是否正确安装。

注意：如果上述两个示例程序都可以正常输出，name恭喜您，GPU环境目前已经可用了！

4.2.3 安装Nsight Eclipse plugins

1

# /usr/local/cuda-9.0/bin/nsight_ee_plugins_manage.sh install <eclipse-dir>

4.3 可选的操作

在使用CUDA Toolkit中，有很多可选操作但是不是必须的但是可以提供额外的功能。

4.3.1 安装第三方库文件

1
2


# yum install freeglut-devel libX11-devel libXi-devel libXmu-devel \
    make mesa-libGLU-devel

4.3.2 为cuda-gdb安装源代码使用runfile方式安装后cuda-gdb源代码会自动安装。

使用RPM或者Deb方式安装，需要为cuda-gbd拷贝一份源代码。cuda-gdb-src包必须被安装。源码包会被以一个tar包的方式安装在/usr/local/cuda-9.0/extras目录。

原文地址

Table of Contents

背景：