Problems getting tensorflow to install with GPU support


#1

I was experimenting with setting up a Flight cluster with GPU nodes and ran into difficulties whilst trying to do a gridware install of tensorflow.

I had previously installed libs/libgpuarray so the basic python and nvidia-cuda environments were there, but I got the following error:

alces gridware install tensorflow/1.6.0 --variant=cuda
Preparing to install main/apps/tensorflow/1.6.0

Required parameters (param=value)

  cuda_vars: Path to a config file for CUDA GPU build [if building with CUDA, specify path to vars file or 'info' for details]

ERROR: No values specified for required parameters: cuda_vars

I could not find an appropriate file name under the nvidia-cuda installation, so I was at a bit of a loss on how to proceed.


#2

Hi @cliffaddison!

The cuda_vars file contains some variables to allow Tensorflow to build with CUDA support. You can see some further information on generating the cuda_vars file and the format of the file required by running alces gridware install main/apps/tensorflow/1.6.0 --variant=cuda cuda_vars=info. It might help if the environment variables are:

CUDNN_ROOT=</path/to/cudnn/root>
CUDNN_VERSION=<cuDNN version e.g. 7.0.5>
CUDA_VERSION=<CUDA version, e.g. 8.0.61>
CUDA_COMPUTE_CAPABILITIES=<CUDA Capabilities of the instance GPU, e.g. 3.0>

If you want you can take a look at the metadata.yml file for this installation - the upstream version is located here: https://github.com/alces-software/gridware-packages-main/blob/master/pkg/apps/tensorflow/1.6.0/metadata.yml

Hope this is useful!

David


#3

Right. I am making progress, but there does not seem to be a gridware installation of cudnn. I needed to copy over the local installation from our local Barkla cluster. I then created the cuda_vars file with the mentioned 4 lines and it is currently compiling. I would have hoped the gridware installations would have been more complete.

Cliff


#4

Hi Cliff,

Gridware is an open source project so feel free to help us make it more complete by contributing your additions in a PR.

Cheers

Steve


#5

Happy to have the cuda-enabled gridware software from Liverpool Barkla made available more widely.

Also happy to give information on bugs etc. above “it doesn’t work”.

Case in point. The tensorflow build on AWS dies:

The output is (I can send the log file):

alces gridware install apps/tensorflow/1.6.0 --variant=cuda cuda_vars=$HOME/cuda_vars
Preparing to install main/apps/tensorflow/1.6.0

WARNING: Package requires the installation of the following:
  main/apps/python/2.7.8, main/libs/atlas/3.10.3, main/apps/patchelf/0.9, main/apps/setuptools/24.0.1, main/libs/numpy/1.10.4, main/apps/pip/9.0.1,
main/apps/bazel/0.11.0

Install these dependencies first?

Proceed (Y/N)? Y
....
Installing main/apps/tensorflow/1.6.0 (cuda)

 > Preparing package sources
        Download --> v1.6.0.tar.gz ... OK
          Verify --> v1.6.0.tar.gz ... OK

 > Preparing for installation
           Mkdir ... OK (/home/alces/.cache/gridware/cache/src/apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61)
         Extract ... OK

 > Proceeding with installation
         Compile ... ERROR: Package compilation failed

   Extract of compilation script error output:
   >
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Stensor_Uforest_Shybrid_Cgen_Utraining_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framewor
k.so: undefined reference to `cuStreamDestroy_v2'
   >
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Stensor_Uforest_Shybrid_Cgen_Utraining_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framewor
k.so: undefined reference to `cuStreamAddCallback'
   >
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Stensor_Uforest_Shybrid_Cgen_Utraining_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framewor
k.so: undefined reference to `cuDevicePrimaryCtxGetState'
   >
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Stensor_Uforest_Shybrid_Cgen_Utraining_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framewor
k.so: undefined reference to `cuEventDestroy_v2'
   >
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Stensor_Uforest_Shybrid_Cgen_Utraining_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framewor
k.so: undefined reference to `cuMemHostAlloc'
   >
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Scontrib_Stensor_Uforest_Shybrid_Cgen_Utraining_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framewor
k.so: undefined reference to `cuDevicePrimaryCtxRelease'
   > collect2: error: ld returned 1 exit status
   > Target //tensorflow/tools/pip_package:build_pip_package failed to build
   > INFO: Elapsed time: 948.929s, Critical Path: 36.01s
   > FAILED: Build did NOT complete successfully

There are some bugs in the system still.

Cliff


#6

Hi Cliff,

Could you let us know what settings you’re using in your $HOME/cuda_vars file please?

With that information, we can attempt to reproduce the problem and suggest a solution.

Thanks,

Mark.


#7

I copied over
/opt/apps/alces/nvidia/cudnn/7.0.5_cuda8.0 from Barkla on the the AWS instance in cudnn

and then created the file cuda_vars containing:

CUDNN_ROOT=/home/alces/cudnn
CUDNN_VERSION=7.0.5
CUDA_VERSION= 8.0.61
CUDA_COMPUTE_CAPABILITIES=8.0

After it failed the first time with a message about pip, I then installed the Gridware pip for python 2.7.1.

(I notice on Barkla that the module description for tensorflow with cuda is:
apps/tensorflow/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.9.2+setuptools-24.0.1+pip-8.1.2+nvidia-cuda-8.0.61)

However, I have just spotted that I forgot to load the pip module - that might have been the critical mistake.

Cliff


#8

I thought loading the pip module cracked the problem, but alas no:

alces gridware install tensorflow/1.6.0 --variant=cuda cuda_vars=/home/alces/cuda_vars
Preparing to install main/apps/tensorflow/1.6.0
Installing main/apps/tensorflow/1.6.0 (cuda)

WARNING: Build directory already exists:
/home/alces/.cache/gridware/cache/src/apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61

Proceed with a clean?

Proceed (Y/N)? Y
Clean … OK

Preparing package sources
Download --> v1.6.0.tar.gz … SKIP (Existing source file detected)
Verify --> v1.6.0.tar.gz … OK

Preparing for installation
Mkdir … OK (/home/alces/.cache/gridware/cache/src/apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61)
Extract … OK

Proceeding with installation
Compile … ERROR: Package compilation failed

Extract of compilation script error output:

INFO: From ProtoCompile tensorflow/contrib/boosted_trees/proto/split_info.pb.cc:
bazel-out/k8-opt/genfiles/external/protobuf_archive/src: warning: directory does not exist.
bazel-out/k8-opt/genfiles/external/protobuf_archive/src: warning: directory does not exist.
INFO: From ProtoCompile tensorflow/contrib/boosted_trees/proto/tree_config.pb.cc:
bazel-out/k8-opt/genfiles/external/protobuf_archive/src: warning: directory does not exist.
ERROR: /home/alces/.cache/bazel/_bazel_alces/20f0602d570c51228ab082de43e278d5/external/nccl_archive/BUILD:33:1: error while parsing .d file:
/home/alces/.cache/bazel/_bazel_alces/20f0602d570c51228ab082de43e278d5/execroot/org_tensorflow/bazel-out/k8-opt/bin/external/nccl_archive/_objs/nccl/externa
l/nccl_archive/src/reduce.cu.pic.d (No such file or directory)
nvcc fatal : Unsupported gpu architecture ‘compute_80’
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 744.084s, Critical Path: 17.22s
FAILED: Build did NOT complete successfully

More information may be available in the log file:

/home/alces/.cache/gridware/log/builds/apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61/bazel_build.20180
809141415.log
[alces@flight-107(GPU-test4) ~]$ module list
Currently Loaded Modulefiles:

  1. services/aws
  2. services/slurm
  3. null
  4. libs/nvidia-cuda/8.0.61/bin
  5. libs/gcc/system
  6. apps/python/2.7.8/gcc-4.8.5
  7. apps/setuptools/24.0.1/python-2.7.8
  8. apps/pip/9.0.1/python-2.7.8

cat cuda_vars
CUDNN_ROOT=/home/alces/cudnn
CUDNN_VERSION=7.0.5
CUDA_VERSION=8.0.61
CUDA_COMPUTE_CAPABILITIES=8.0


#9

Hi Cliff,

Thanks for all those details. We’ve verified that the package builds on both p2.xlarge and p3.2xlarge instances, though we did need to set CUDA_COMPUTE_CAPABILITIES to different values on both instance types due to the difference between the GPU hardware available.

For p2.xlarge (Tesla K80) we used 3.7:

CUDNN_ROOT=/home/alces/cudnn/7.0.5_cuda8.0
CUDNN_VERSION=7.0.5
CUDA_VERSION=8.0.61
CUDA_COMPUTE_CAPABILITIES=3.7

And for p3.2xlarge (Tesla V100-SXM2-16GB), we used 7.0:

CUDNN_ROOT=/home/alces/cudnn/7.0.5_cuda8.0
CUDNN_VERSION=7.0.5
CUDA_VERSION=8.0.61
CUDA_COMPUTE_CAPABILITIES=7.0

These values can be retrieved by looking at the “CUDA Capability Major/Minor version number” value output from the the deviceQuery command that ships with the Nvidia toolkit, e.g.:

[alces@flight-161(markt-gpu) ~]$ module load libs/nvidia-cuda
libs/nvidia-cuda/8.0.61/bin
 |
 OK

[alces@flight-161(markt-gpu) ~]$ $NVIDIA_CUDADIR/toolkit/extras/demo_suite/deviceQuery
/opt/gridware/depots/u/0724aa/el7/pkg/libs/nvidia-cuda/8.0.61/bin/toolkit/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K80"
  CUDA Driver Version / Runtime Version          9.0 / 8.0
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 11440 MBytes (11995578368 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K80
Result = PASS

Also, to confirm, this is what our cuDNN tree looks like:

[alces@flight-161(markt-gpu) ~]$ tree /home/alces/cudnn
/home/alces/cudnn
└── 7.0.5_cuda8.0
    ├── include
    │   └── cudnn.h
    ├── lib64
    │   ├── libcudnn.so -> libcudnn.so.7
    │   ├── libcudnn.so.7 -> libcudnn.so.7.0.5
    │   ├── libcudnn.so.7.0.5
    │   └── libcudnn_static.a
    └── NVIDIA_SLA_cuDNN_Support.txt

With the cuDNN tree and cuda_vars parameter file in place, we performed the Gridware installation as follows:

al gr inst apps/tensorflow/1.6.0 --variant=cuda cuda_vars=/home/alces/cuda_vars

Once built, we loaded the module – one further additional step required is to ensure that the cuDNN library directory is on the LD_LIBRARY_PATH, without which Tensorflow is unable to locate the shared library:

[alces@flight-161(markt-gpu) ~]$ module load apps/tensorflow_cuda/1.6.0
apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61
 | -- libs/gcc/system
 |    * --> OK
 | -- apps/python/2.7.8/gcc-4.8.5
 |    | -- libs/gcc/system ... SKIPPED (already loaded)
 |    * --> OK
 | -- apps/setuptools/24.0.1/python-2.7.8
 |    | -- apps/python/2.7.8/gcc-4.8.5 ... SKIPPED (already loaded)
 |    * --> OK
 | -- libs/numpy/1.10.4/gcc-4.8.5+atlas-3.10.2+python-2.7.8
 |    | -- libs/gcc/system ... SKIPPED (already loaded)
 |    | -- VARIANT libs/atlas/3.10.3
 |    |    | -- libs/gcc/system ... SKIPPED (already loaded)
 |    |    * --> OK
 |    | -- apps/python/2.7.8/gcc-4.8.5 ... SKIPPED (already loaded)
 |    * --> OK
 | -- libs/nvidia-cuda/8.0.61/bin
 |    * --> OK
 |
 OK

[alces@flight-161(markt-gpu) ~]$ export LD_LIBRARY_PATH=/home/alces/cudnn/7.0.5_cuda8.0/lib64:$LD_LIBRARY_PATH

[alces@flight-161(markt-gpu) ~]$ $TENSORFLOW_CUDADIR/python/bin/tensorboard --help

       USAGE: /opt/gridware/depots/u/0724aa/el7/pkg/apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setup
       tools-24.0.1+nvidia-cuda-8.0.61/python/bin/tensorboard [flags]

Try --helpfull to get a list of all flags.

As an aside: the reason Gridware is unable to provide cuDNN is due to restrictions on its distribution due to the Nvidia licensing conditions – hence the few hoops that need to be jumped through in this case!

HTH!

Thanks,

Mark.


#10

OK. That all makes sense. I’ll have a play just to confirm that (won’t be until Monday).

Interesting about the CUDA_CAPABILITIES setting. Clearly I was misunderstanding that and I was not copying over all of the cdnn directory structure!!

Cliff


#11

I can confirm that compilation was successful, but it took ages!

Cliff