Installing TensorFlow with Python 3 on EC2 GPU Instances

Published in

EatCodePlay

7 min readSep 28, 2016

Update May 2016: We’ve automated much of this tutorial with additional improvements: https://github.com/RealScout/deep-learning-images.

Overview

The following steps will get you up and running with GPU-enabled TensorFlow on an Ubuntu 14.04 Linux GPU EC2 instance. It’ll also walk through installing Anaconda Python 3.4, so that all of the important data packages (numpy, sklearn, jupyter, etc) are at your disposal.

From start to finish, installation should take ~75 minutes.

What this guide covers

Provisioning the GPU EC2 instance
Installing and configuring CUDA and cuDNN
Installing Anaconda and Python 3.4
Building TensorFlow from source
Configuring Jupyter Notebook

Credits

These steps are taken almost verbatim from Erik Bern’s awesome tensorflow installation gist. (His blog is a must read btw.) Many thanks go to the commenters on the gist for clarifications as well.

According to Erik’s gist, most/all of the CUDA steps were taken from Traun Leyden’s great post on getting CUDA 6.5 up and running on AWS.

Finally, credit for the Jupyter EC2 configuration goes to another excellent blog post by Jie Yang.

Provision Instance

First things first — let’s provision a Linux GPU EC2 Instance.

AMI and Instance Type

We’ll use Amazon’s Ubuntu 14.04 (HVM) public ami, ami-06116566, and the g2.2xlarge instance type.

Disk Size

Traun recommends provisioning a 20GB+ volume. We used 30GB when recreating the steps below and had no troubles.

Security Group

Make sure you open up port 22 for ssh and ports 443 and 8888 for accessing Jupyter notebooks. You can choose to allow from any IPs (shown below) or limit to custom addresses if you’d prefer.

Type Protocol Port Range Source SSH TCP 22 0.0.0.0/0 HTTPS TCP 443 0.0.0.0/0 Custom TCP Rule TCP 8888 0.0.0.0/0

Install Dependencies

Let’s create a screen session and jump into the newly provisioned server to install the basic dependencies.

Log in

# local
ip=53.32.222.185 # The ip address of your ec2 instance
user=ubuntu
ssh -t $user@$ip "screen -dR setup"

Update Packages

Next, we’ll update Apt-Get, upgrade installed packages and add a few more that we’ll need for compiling later.

# ec2
sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get install -y build-essential git swig default-jdk zip zlib1g-dev

Setup CUDA

After we take care of some prerequisites, we’ll install CUDA 7.0 and CUDNN 6.5 and be on our way.

Prerequisites

First, we need to blacklist Nouveau which has a conflict with the nvidia driver.

# ec2
echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf sudo update-initramfs -u sudo reboot

Once the server comes back up, let’s log back in to install another package per Traun and reboot.

# local
ssh -t $user@$ip "screen -dR setup"# ec2
sudo apt-get install -y linux-image-extra-virtual sudo reboot

One more time:

# local
ssh -t $user@$ip "screen -dR setup"# ec2
sudo apt-get install -y linux-source linux-headers-`uname -r`

Install CUDA 7.0

First, download the install script and verify the checksum.

wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run cs=$(md5sum cuda_7.0.28_linux.run | cut -d' ' -f1) if [ "$cs" != "312aede1c3d1d3425c8caa67bbb7a55e" ]; then echo "WARNING: Unverified MD5 hash"; fi

Next, we need to run the NVIDIA installation scripts.

chmod +x cuda_7.0.28_linux.run ./cuda_7.0.28_linux.run -extract=`pwd`/nvidia_installers pushd nvidia_installers sudo ./NVIDIA-Linux-x86_64-346.46.run # When prompted you'll need to: # * Agree to the license terms # * Accept the X Library path and X module path # * Accept that 32-bit compatibility files will not be installed # * Review and accept the libvdpau and libvdpau_trace libraries notice # * Choose `Yes` when asked about automatically updating your X configuration file # * Verify successful installation by choosing `OK` sudo modprobe nvidia

Now we can run the CUDA installation script.

sudo ./cuda-linux64-rel-7.0.28-19326674.run # When prompted, you'll need to: # * Accept the license terms (long scroll, page down with `f`) # * Use the default installation path # * Answer `n` to desktop shortcuts # * Answer `y` to create a symbolic link

Install cuDNN 6.5

Note: You’ll need to register to get accepted into the NVIDIA Accelerated Computing Developer Program and download the cuDNN v2 Library for Linux to your local workstation. (We were accepted within one business day.)

Let’s log out, scp the file up to the server, then log back in.

# ec2 exit# local scp ~/Downloads/cudnn-6.5-linux-x64-v2.tgz $user@$ip:~/ ssh -t $user@$ip "screen -dR setup"

Then installation is as simple as copying a few files.

# ec2 tar -xzf cudnn-6.5-linux-x64-v2.tgz sudo cp cudnn-6.5-linux-x64-v2/libcudnn* /usr/local/cuda/lib64 sudo cp cudnn-6.5-linux-x64-v2/cudnn.h /usr/local/cuda/include/

Install Python 3.4 and Tensorflow

Erik mentions that he had some trouble with disk space while building Tensorflow, so we’ll follow his advice in pointing /tmp to /mnt/tmp before we get started with the installation.

sudo mkdir /mnt/tmp sudo chmod 777 /mnt/tmp sudo rm -rf /tmp sudo ln -s /mnt/tmp /tmp

Install Anaconda with Python 3.4

Anaconda comes with many, many packages and is a breeze to install. Version 2.4.1 comes with Python 3.5.1 — we’ll need to configure it to use Python 3.4 as Tensorflow doesn’t currently support 3.5.

First, we can download the install script and verify the checksum.

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda3-2.4.1-Linux-x86_64.sh cs=$(md5sum Anaconda3-2.4.1-Linux-x86_64.sh | cut -d' ' -f1) if [ "$cs" != "45249376f914fdc9fd920ff419a62263" ]; then echo "WARNING: Unverified MD5 hash"; fi

Next, run the install script.

bash Anaconda3-2.4.1-Linux-x86_64.sh # When promted you'll need to: # * Accept the license terms # * Choose the default installation location `/home/ubuntu/anaconda3` # * Allow the installer to add the Anaconda installation path to your `/home/ubuntu/.bashrc` source ~/.bashrc

Finally, we install Python 3.4 and Pip.

conda install anaconda python=3.4 -y conda install pip -y

Install JDK 8

Recent versions of Bazel require JDK 8, so let’s install it.

sudo add-apt-repository ppa:openjdk-r/ppa # When prompted you'll need to press ENTER to continue sudo apt-get update sudo apt-get install -y openjdk-8-jdk

We’ll then need to configure the system default to Java 8. When prompted, choose the number corresponding to Java 8. (it should be #2 in both cases)

sudo update-alternatives --config java sudo update-alternatives --config javac

Build Bazel

Here, we clone the repo, compile the 0.1.4 release and copy the binaries to /usr/bin.

cd /mnt/tmp git clone https://github.com/bazelbuild/bazel.git cd bazel git checkout tags/0.1.4 ./compile.sh sudo cp output/bazel /usr/bin

Build Tensorflow

I recommend that you add these two exports to ~/.bashrc: they need to be around when building as well as in the future when you use TensorFlow in your python scripts.

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" export CUDA_HOME=/usr/local/cuda

With our environment setup, we can clone the tensorflow repo. (The commit hash shown below was the latest commit on master at time of writing.)

cd /mnt/tmp git clone --recurse-submodules https://github.com/tensorflow/tensorflow cd tensorflow git checkout 437646b1495266cd4e5058f96bb06098785f4d4e

And now on to configuring the build.

TF_UNOFFICIAL_SETTING=1 ./configure # When prompted, you'll need to: # * Accept the default `anaconda3` python location # * Build with GPU support # * Accept the defaults for Cuda SDK and Cudnn versions and locations. # * Specify Cuda compute device capability of `3.0` # **Don't accept the default of `3.5,5.2`**

And finally building TensorFlow with Bazel.

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer

With TensorFlow built, we can now create and install the Pip package.

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg pip install /tmp/tensorflow_pkg/tensorflow-0.6.0-py3-none-any.whl

Test it out!

cd /mnt/tmp/tensorflow/tensorflow/models/image/cifar10/ python cifar10_multi_gpu_train.py

If everything has been configured correctly, you should see references to TensorFlow using device /gpu:0 when running the script above.

Jupyter Notebook Setup

Let’s login to the server from local again to create a new screen session to run Jupyter in.

# local ssh -t $user@$ip "screen -dR notebook"

Configure Jupyter

First, generate the initial config and generate the password you’ll use to log in to Jupyter in the browser.

# ec2 jupyter notebook --generate-config key=$(python -c "from notebook.auth import passwd; print(passwd())") # When prompted you'll need to specify a password

Next, we’ll create the certificate.

cd ~ mkdir certs cd certs certdir=$(pwd) openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.key -out mycert.pem # You'll be prompted to enter values for the cert, # but it doesn't much matter, you can just leave them all blank.

And finally setup the Jupyter config with our cert, password and port.

cd ~ sed -i "1 a\ c = get_config()\\ c.NotebookApp.certfile = u'$certdir/mycert.pem'\\ c.NotebookApp.ip = '*'\\ c.NotebookApp.open_browser = False\\ c.NotebookApp.password = u'$key'\\ c.NotebookApp.port = 8888" .jupyter/jupyter_notebook_config.py

Start Jupyter

Now let’s run Jupyter with the configuration we just created. After it starts running, you can hit ctrl-a then d to disconnect from your screen session while leaving jupyter running.

Be sure you’ve added LD_LIBRARY_PATH and CUDA_HOME to your environment.

mkdir -p ~/notebooks cd ~/notebooks jupyter notebook --certfile=~/certs/mycert.pem --keyfile ~/certs/mycert.key

Open up Jupyter in the browser

# local open https://$ip:8888

You’ll get a security warning because your certificate you created above is not verified. To get through, click the Advanced link and proceed. At that point, you can enter the password you specified above.

Fini!

That was quite a lot, but we now have TensorFlow talking to our GPU, Python 3.4, all the deep learning packages we could ever want and the amazing Jupyter playground to play with everything in.

Troubleshooting

ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory

If you receive the following error when trying to import tensorflow, make sure you’ve correctly setup your environment with LD_LIBRARY_PATH and CUDA_HOME.

Originally published at eatcodeplay.com by Chris Conley.