前言: 在一节中[](https://xxbandy.github.io/2017/10/26/GPU%E7%8E%AF%E5%A2%83%E4%B8%8B%E7%8E%A9%E8%BD%ACDocker-%E4%B8%80/),我们已经在GPU物理机上准备好了GPU环境,本篇文章介绍如何使用Docker来管理GPU容器。

使用NVIDIA驱动Docker容器来识别gpu环境

使用Docker方式来运行GPU任务计算。英伟达(nvidia)官方提供了一个插件NVIDIA-Docker,封装docker的相关参数来绑定GPU相关信息,以给容器提供gpu环境。下面主要介绍使用nvidia-docker插件运行容器以共享宿主机的GPU资源,后面会捎带的讲解如何使用docker原生的方式来运行GPU容器环境。

nvidia-docker部署gpu应用 nvidia-docker的动机

nvidia插件安装指导 nvidia-docker使用指南 nvidia-docker插件

nvidia驱动 GPU隔离技术 镜像检测

快速部署安装

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# 安装 nvidia-docker and nvidia-docker-plugin
$ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
$ rpm -i /tmp/nvidia-docker*.rpm && rm /tmp/nvidia-docker*.rpm
$ systemctl start nvidia-docker

# 测试docker容器内部可以识别宿主机的GPU环境
# nvidia-docker run --rm idockerhub.xxb.com/nvidia-docker/cuda8.0-runtime:centos6-17-10-19 nvidia-smi
Thu Oct 19 08:07:09 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 0000:04:00.0     Off |                    0 |
| N/A   28C    P8    18W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           On   | 0000:05:00.0     Off |                    0 |
| N/A   31C    P8    17W / 250W |      0MiB / 11443MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40 24GB      On   | 0000:06:00.0     Off |                    0 |
| N/A   26C    P8    18W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40           On   | 0000:07:00.0     Off |                    0 |
| N/A   27C    P8    16W / 250W |      0MiB / 11443MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# 绑定GPU核心到docker容器内部(可以看到该容器内部其实只绑定了一块宿主机上的第一块GPU卡)实现GPU隔离
# NV_GPU=0 nvidia-docker run --rm idockerhub.xxb.com/nvidia-docker/cuda8.0-runtime:centos6-17-10-19 nvidia-smi
Wed Oct 25 07:17:03 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 0000:04:00.0     Off |                    0 |
| N/A   23C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# NV_GPU="0,2" nvidia-docker run --rm -ti idockerhub.xxb.com/nvidia-docker/cuda8.0-runtime:centos6-17-10-19 nvidia-smi
Wed Oct 25 08:37:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 0000:04:00.0     Off |                    0 |
| N/A   34C    P0    59W / 250W |  21802MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 0000:06:00.0     Off |                    0 |
| N/A   34C    P0    57W / 250W |  21800MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

注意1:上述输出的表格中可用看到,使用nvidia-docker工具创建的容器内部其实是可以识别到宿主机的GPU设备的

注意2:CPU和GPU其实很类似,无法智能识别需要几块卡设备,只能通过NV_GPU方式来给容器绑定GPU卡(类似于cpu set的方式);需要管理员动态的调整需要分配的具体GPU卡设备

容器内部GPU环境验证

2.1 使用官方的gpu版本的tensorflow镜像测试GPU设备

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# nvidia-docker run -it --rm -v /usr/lib64/libcuda.so.1:/usr/local/nvidia/lib64/libcuda.so.1 idockerhub.xxb.com/jdjr/tensorflow-gpu:17-10-17 bash

root@6b4ad215279e:/notebooks# python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
>>> b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
>>> c = tf.matmul(a, b)
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2017-10-19 08:01:39.862500: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-19 08:01:39.862600: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-19 08:01:39.862646: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-19 08:01:39.862676: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-19 08:01:39.862711: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-19 08:01:40.388656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:04:00.0
Total memory: 22.40GiB
Free memory: 22.29GiB
2017-10-19 08:01:40.682810: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1e2dac0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-19 08:01:40.684222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:05:00.0
Total memory: 11.17GiB
Free memory: 11.07GiB
2017-10-19 08:01:40.995170: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x329a280 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-19 08:01:40.998560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:06:00.0
Total memory: 22.40GiB
Free memory: 22.29GiB
2017-10-19 08:01:41.289133: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x329dc00 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-19 08:01:41.290444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 3 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:07:00.0
Total memory: 11.17GiB
Free memory: 11.07GiB
2017-10-19 08:01:41.294062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3
2017-10-19 08:01:41.294083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y Y Y
2017-10-19 08:01:41.294093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y Y Y
2017-10-19 08:01:41.294156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2:   Y Y Y Y
2017-10-19 08:01:41.294178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3:   Y Y Y Y
2017-10-19 08:01:41.294215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:04:00.0)
2017-10-19 08:01:41.294229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:05:00.0)
2017-10-19 08:01:41.294239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla M40 24GB, pci bus id: 0000:06:00.0)
2017-10-19 08:01:41.294248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla M40, pci bus id: 0000:07:00.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla M40 24GB, pci bus id: 0000:04:00.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla M40, pci bus id: 0000:05:00.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla M40 24GB, pci bus id: 0000:06:00.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla M40, pci bus id: 0000:07:00.0
2017-10-19 08:01:41.875931: I tensorflow/core/common_runtime/direct_session.cc:300] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla M40 24GB, pci bus id: 0000:04:00.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla M40, pci bus id: 0000:05:00.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla M40 24GB, pci bus id: 0000:06:00.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla M40, pci bus id: 0000:07:00.0

>>> print(sess.run(c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-10-19 08:01:51.333248: I tensorflow/core/common_runtime/simple_placer.cc:872] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
b: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-10-19 08:01:51.333346: I tensorflow/core/common_runtime/simple_placer.cc:872] b: (Const)/job:localhost/replica:0/task:0/gpu:0
a: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-10-19 08:01:51.333408: I tensorflow/core/common_runtime/simple_placer.cc:872] a: (Const)/job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]
>>>

注意:由以上完整输出可以看到当前的容器环境内部可以正常识别到GPU设备,并可以进行GPU计算

2.1 使用TensorFlow-gpu环境

官方tensorflow/tensorflow:latest-gpu镜像是使用jupyter环境来提供一个可编程的TensorFlow环境。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# nvidia-docker run -itd -p 8888:8888 -p 6006:6006 idockerhub.xxb.com/jdjr/tensorflow-gpu:17-10-17
# docker logs -f evil_ride
[I 07:24:37.730 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[W 07:24:37.742 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 07:24:37.747 NotebookApp] Serving notebooks from local directory: /notebooks
[I 07:24:37.747 NotebookApp] 0 active kernels
[I 07:24:37.747 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=b4f66281d6b1f89bd6fda85c6e88a022ef6d38308e7f284b
[I 07:24:37.748 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 07:24:37.748 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=b4f66281d6b1f89bd6fda85c6e88a022ef6d38308e7f284b

成功运行后如上述日志显示,直接访问上述链接,或者访问本机的8888端口,并将后面的token作为密码输入后即可登录jupyter环境,使用tensorflow环境。

点击右上角New下面的Python2可进入jupyter的交互式编程环境。并进行GPU简单程序的测试,测试结果如下:

可以看到,使用TensorFlow官方的镜像文件,可以直接使用jupyter进行运行gpu计算任务。该任务运行完毕之后,运行过程中的日志信息如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# docker logs -f evil_ride
[I 07:37:01.806 NotebookApp] Adapting to protocol v5.1 for kernel 253b4ef2-b573-4256-a14b-6a840777923d
2017-10-25 07:37:41.174144: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-25 07:37:41.174295: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-25 07:37:41.174320: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-25 07:37:41.174342: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-25 07:37:41.174366: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-25 07:37:41.719139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:04:00.0
Total memory: 22.40GiB
Free memory: 22.29GiB
2017-10-25 07:37:41.999138: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1d90ab0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-25 07:37:42.001131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:05:00.0
Total memory: 11.17GiB
Free memory: 11.07GiB
2017-10-25 07:37:42.317011: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x3acbf10 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-25 07:37:42.318986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties:
name: Tesla M40 24GB
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:06:00.0
Total memory: 22.40GiB
Free memory: 22.29GiB
2017-10-25 07:37:42.612283: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x3acf890 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-10-25 07:37:42.614320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 3 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:07:00.0
Total memory: 11.17GiB
Free memory: 11.07GiB
2017-10-25 07:37:42.619214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3
2017-10-25 07:37:42.619252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y Y Y
2017-10-25 07:37:42.619270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y Y Y
2017-10-25 07:37:42.619287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2:   Y Y Y Y
2017-10-25 07:37:42.619351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3:   Y Y Y Y
2017-10-25 07:37:42.619386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:04:00.0)
2017-10-25 07:37:42.619410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:05:00.0)
2017-10-25 07:37:42.619431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla M40 24GB, pci bus id: 0000:06:00.0)
2017-10-25 07:37:42.619464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla M40, pci bus id: 0000:07:00.0)
2017-10-25 07:37:56.406493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:04:00.0)
2017-10-25 07:37:56.406573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:05:00.0)
2017-10-25 07:37:56.406607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla M40 24GB, pci bus id: 0000:06:00.0)
2017-10-25 07:37:56.406627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla M40, pci bus id: 0000:07:00.0)
[I 07:39:01.292 NotebookApp] Saving file at /Untitled1.ipynb

由上述日志输出可以看到该任务其实是占用宿主机的4个GPU卡进行任务计算的。

注意:因为jupyter是通过websocket和python环境进行交互的,如果需要使用反向代理之类的工具访问服务,需要反向代理工具支持websocket协议。在实际部署过程中,我们使用Nginx进行反向代理,Nginx默认是在1.3版本开始支持websocket的,如果nginx版本大于1.3只需要对相应的配置文件中增加如下内容即可。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
upstream tomcat_txxb.wp.com {
        server 10.0.0.10:8888  weight=10 max_fails=2 fail_timeout=30s;
                }
location / {
        proxy_next_upstream     http_500 http_502 http_503 http_504 error timeout invalid_header;
        proxy_set_header        Host  $host;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass              http://tomcat_txxb.wp.com;

location ~ /api/kernels/ {
        proxy_pass            http://tomcat_txxb.wp.com;
        proxy_set_header      Host $host;
        # websocket support
        proxy_http_version    1.1;
        proxy_set_header      Upgrade "websocket";
        proxy_set_header      Connection "Upgrade";
        proxy_read_timeout    86400;
    }
location ~ /terminals/ {
        proxy_pass            http://tomcat_txxb.wp.com;
        proxy_set_header      Host $host;
        # websocket support
        proxy_http_version    1.1;
        proxy_set_header      Upgrade "websocket";
        proxy_set_header      Connection "Upgrade";
        proxy_read_timeout    86400;
}

使用docker原生方式运行GPU容器环境(无非就是指定设备挂载或者将给容器开放特权来操作宿主机硬件)

1
2
3
4
5
# export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
# docker run  $DEVICES -it --rm -v /usr/lib64/libcuda.so.1:/usr/local/nvidia/lib64/libcuda.so.1 -v /usr/lib64/libnvidia-fatbinaryloader.so.375.39:/usr/local/nvidia/lib64/libnvidia-fatbinaryloader.so.375.39  -v /root/gpu-example/:/tmp idockerhub.xxb.com/jdjr/tensorflow-gpu:17-10-17 bash


# docker run  --privileged -it --rm -v /usr/lib64/libcuda.so.1:/usr/local/nvidia/lib64/libcuda.so.1 -v /usr/lib64/libnvidia-fatbinaryloader.so.375.39:/usr/local/nvidia/lib64/libnvidia-fatbinaryloader.so.375.39  -v /root/gpu-example/:/tmp idockerhub.xxb.com/jdjr/tensorflow-gpu:17-10-17 bash

注意:当然在上述tensorflow测试过程中,只需要部分的cuda依赖库,在很多其他的业务场景下可能需要其他的依赖库,建议使用如下方式挂载全部库依赖

1
2
# export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v  {}:{}')  $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
# docker run $DEVICES $CUDA_SO --rm idockerhub.xxb.com/jdjr/tensorflow-gpu:17-10-17 bash

思考:其实从nvidia-docker的运行方式和docker原生的运行方式上来看,nvidia其实是使用自己的插件来封装了一些docker的默认参数,比如说上述的GPU设备,以及相关lib库依赖。从nvidia-docker的底层实现上也可以看得出来

1
2
# curl -s http://localhost:3476/docker/cli
--volume-driver=nvidia-docker --volume=nvidia_driver_375.39:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3

也就是说nvidia-docker在创建容器的时候,默认是加入了上述的参数的。也就是docker原生调用过程中使用的-v--device.其实可以查看nvidia-docker工具生成的容器的相关配置:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# docker inspect evil_ride
"HostConfig": {
            "Binds": [
                "nvidia_driver_375.39:/usr/local/nvidia:ro"
            ],
            ........
}

"Devices": [
                {
                    "PathOnHost": "/dev/nvidiactl",
                    "PathInContainer": "/dev/nvidiactl",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/nvidia-uvm",
                    "PathInContainer": "/dev/nvidia-uvm",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/nvidia-uvm-tools",
                    "PathInContainer": "/dev/nvidia-uvm-tools",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/nvidia0",
                    "PathInContainer": "/dev/nvidia0",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/nvidia1",
                    "PathInContainer": "/dev/nvidia1",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/nvidia2",
                    "PathInContainer": "/dev/nvidia2",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/nvidia3",
                    "PathInContainer": "/dev/nvidia3",
                    "CgroupPermissions": "rwm"
                }
            ],


        "Mounts": [
            {
                "Name": "nvidia_driver_375.39",
                "Source": "/var/lib/nvidia-docker/volumes/nvidia_driver/375.39",
                "Destination": "/usr/local/nvidia",
                "Driver": "nvidia-docker",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            }
        ],


        "Env": [
                "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "CUDA_VERSION=8.0.61",
                "NVIDIA_CUDA_VERSION=8.0.61",
                "CUDA_PKG_VERSION=8-0=8.0.61-1",
                "LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
                "LIBRARY_PATH=/usr/local/cuda/lib64/stubs:",
                "CUDNN_VERSION=6.0.21"
            ],


        "Labels": {
                "com.nvidia.build.id": "29071360",
                "com.nvidia.build.ref": "836d5387f8888c3924aff7a011f9b2cd9956d3db",
                "com.nvidia.cuda.version": "8.0.61",
                "com.nvidia.cudnn.version": "6.0.21",
                "com.nvidia.volumes.needed": "nvidia_driver",
                "maintainer": "NVIDIA CORPORATION \u003ccudatools@nvidia.com\u003e"
            }

注意:nvidia-driver的版本以及GPU型号以及nvidia-docker的版本三者是有强关联关系的

GPU nvidia-driver nvidia-docker
Tesla M40 NVIDIA-Linux-x86_64-375.39.run nvidia-docker-1.0.1-1.x86_64
P40 NVIDIA-Linux-x86_64-384.81.run XX