Capturing customised gridware environments for later cloud use


#1

Having successfully built tensorflow, I really don’t want to spend over an hour rebuilding it again.

What is the best way to capture the relevant information / data for a sane new build of a cloud instance?

I’ve tar’ed and downloaded /opt and alces/gridware along with the cuda_vars file. cuddn I can copy over from our local installation.

I’m asking partly because I now want to add cuda-configured theano (and then add keras to that installation) to match what a local researcher group requires. Having a cluster for their use, with a common file system with test / training data and the potential to expand compute nodes so each group member has a dedicated instance is the objective, but I need to get the right software in place and to support occasions where that cluster gets shutdown and then restarted as their needs change.

Cliff


#2

Hi @cliffaddison,

There’s a few ways of doing this ranging from easy to complex. We’d recommend that you test exporting the Gridware package you’ve created with the alces gridware export command - this will export the package as a tarball which is then portable and can be placed into an S3 bucket and can be pulled down and installed at a later date.

[root@login1(mycluster) ~]# module avail apps/relion_cudafloat
---  /opt/gridware/local/el7/etc/modules  ---
  apps/relion_cudafloat/2.1/gcc-4.8.5+fftw3_double-3.3.4+fltk-1.3.0+openmpi-1.10.2+cmake-3.5.2+nvidia-cuda-8.0.61
[root@login1(mycluster) ~]# alces gridware export apps/relion_cudafloat
Exporting apps/relion_cudafloat/2.1

 > Export (gcc-4.8.5+fftw3_double-3.3.4+fltk-1.3.0+openmpi-1.10.2+cmake-3.5.2+nvidia-cuda-8.0.61)
         Prepare ... OK
           Ready ... OK

 > Creating archive
         Archive ... OK

Exported apps/relion_cudafloat/2.1 to /tmp/apps-relion_cudafloat-2.1-el7.tar.gz

[root@login1(mycluster) ~]# ls /tmp/apps-relion_cudafloat-2.1-el7.tar.gz 
/tmp/apps-relion_cudafloat-2.1-el7.tar.gz

You can then import this package which will download and install the dependencies for this package for you. In this instance some packages are available, however in particular, the libs/nvidia-cuda/8.0.61 package is not available - Alces Gridware will pull this dependency package from the remote repository before attempting to import the archive as a Gridware package.

[root@login1(mycluster) ~]# alces gridware import /tmp/apps-relion_cudafloat-2.1-el7.tar.gz 
Importing apps-relion_cudafloat-2.1-el7.tar.gz

 > Preparing import
         Extract ... OK
          Verify ... OK

 > Processing apps/relion_cudafloat/2.1/gcc-4.8.5+fftw3_double-3.3.4+fltk-1.3.0+openmpi-1.10.2+cmake-3.5.2+nvidia-cuda-8.0.61
       Preparing ... NOTICE: importing requirements
--------------------------------------------------------------------------------
Preparing to install main/libs/nvidia-cuda/8.0.61
Installing main/libs/nvidia-cuda/8.0.61
Importing libs-nvidia-cuda-8.0.61-el7.tar.gz

 > Fetching archive
        Download ... SKIP (Existing source file detected)

 > Preparing import
         Extract ... OK
          Verify ... OK

 > Processing libs/nvidia-cuda/8.0.61/bin
       Preparing ... OK
       Importing ... OK
     Permissions ... OK

 > Finalizing import
          Update ... OK
    Dependencies ... OK

Installation complete.
--------------------------------------------------------------------------------
NOTICE: requirements for apps/relion_cudafloat/2.1 satisfied; proceeding to import
       Importing ... OK
     Permissions ... OK

 > Finalizing import
          Update ... OK
    Dependencies ... OK

You can also perform this for package dependencies and create a script to install these if there are dependencies that also take some time to compile or have been created by yourself and are not available in the volatile or main repositories of Alces Gridware.

You can also use the alces gridware depot tool to archive collections of packages, which can be uploaded to an S3 bucket and imported when required. We had a community user enquire about this recently and @mjtko provided a great guide on how to get this working - you can read more about that here!

Hope this helps!
Ruan


#3

Thanks that sounds more maintainable than me just grabbing copies of /opt and the alces gridware folder. I will have a try.

I was concerned that depots just set-up the metaware for a particular set-up, but did not actually create a copy of the installed tools.

Cliff


#4

There are still some issues.

I tried doing the export, but ran into the error:

alces gridware export apps/tensorflow_cuda/1.6.0/gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61
Exporting apps/tensorflow_cuda/1.6.0

Export (gcc-4.8.5+python-2.7.8+numpy-1.10.4+setuptools-24.0.1+nvidia-cuda-8.0.61)
Prepare … OK
Ready … ERROR: Package contains hard-coded directory (python/lib/python2.7/site-packages/site.pyc)

I then attempted to go the depot route, but that did not seem capable of exploiting already installed applications, so I’m faced with another compile rebuild that takes over an hour. Perhaps for these time consuming applications, best to install in a depot from the get go and then get a copy of the depot on s3 storage so it can be redeployed on new cluster instances.

Cliff


#5

Hi Cliff,

The error you’re seeing is due to certain files having hard coded paths within themselves, when the package is imported by Alces Gridware, it’s path will change due to the depot it is installed into. Python creates .pyc files which are pre-compiled python that can be run quicker than the just-in-time compiled python, these files contain hard-coded paths to where the source was compiled from. Sometimes these can be created if you run the application as root - since these are often created on the first run. You’ve got a couple of ways around this:

  • You can supply a --ignore-bad to alces gridware export - this will make the tool ignore the problem and export the file anyway. You can then import the archive into a new depot or cluster and check to see if the application still works. This often works just fine as other environment variables are set which override the defaults in the .pyc files.
  • You can supply --patch-binary to alces gridware export - this will cause the tool to patch the binary by replacing the depot path with the new depot path when alces gridware importis run.
  • You can ensure that the environment variable $PYTHONDONTWRITEBYTECODE is set to true to ensure that bytecode is not written by Python on during the installation process and if you’re running the application as root to ensure that the installation works before an export.

Using the above methods individually or a combination thereof should hopefully allow you to export the package and import the package freely.

Hope that helps!

Ruan