Everyone who’s done a Deep Learning project knows how long it takes to train a single model not to mention optimizing it.
As a current student, I don’t have access to any GPUs which sometimes leads to frustration because I can spend most of my time waiting for the model to see all the training batches.
As you might have heard, GPUs can considerably decrease training time when used for Deep Learning projects. Why is that? And what are GPUs? In a nutshell, a GPU is a graphics processing unit that is now integrated in most laptops to efficiently manipulate computer graphics and image processing. It has a highly parallel structure that allows it to make more computations in parallel than the usual CPU (central processing unit). This is why GPUs have become the popular kid when it comes to neural networks (NN). Training a neural network requires a lot of matrix multiplications and, with an increasing number of neurons per layer, these matrices are fast becoming very large. A CPU would do every operation of these matrix multiplications sequentially whereas the GPU can handle them in parallel. That’s a fairly simple explanation of why people want GPUs for their Deep Learning projects.
After this little introduction, you might wonder how to get a hold of such powerful tools for your next project?
Here are some steps on how to easily leverage IBM’s Watson Machine Learning service to use GPUs — which is completely free up to a certain amount of training hours. This process proved easy and it’s perfect for everyone who wants to quickly train their model and play around.
- Set up an IBM Cloud Account on the Watson page.
- Go to the IBM Cloud catalog to see the different available services.
- To train a neural network, we need three things: data, code and an environment to combine both things.
We’ll use the same categories to set up our project in the cloud and for each one we’ll use a service on a freely available light-weight plan.
- First, we’ll need a place to store our training/testing data that makes it available on the cloud. For that purpose, we use a light-weight instance of Cloud Object Storage (COS) which we can create here. After initializing this instance, we can find it in our Cloud Dashboard under Services. Create two buckets in that instance: one for storing our training/test data and one for saving our model results after the training is done.
We can then upload our data to the assigned bucket.
- Next, create a lite version of the Watson Studio. Here you create a new project. Inside this project, we’ll later create experiments that run our training code. We’ll come back to this later.
- The last instance we need is a Machine Learning Service. We can use the lite plan again which is enough to start a training. The service provides the GPUs we need.
FINALLY! A lot of instances to set up! However, it makes sense for every one of them to be part of our training as I mentioned before: COS for our Data, Machine Learning for our Code, and Watson Studio as an environment to combine both things.
Now that you’re ready to train your model, there’s just one more thing left to do: adjust the paths in our code to access the data that we have in our COS buckets and to save the results correctly.
Python Code Modifications
We need paths for our data:
- Input Data Directory:
You can access this directory using the DATA_DIR variable. This folder contains the training data that we uploaded to our Data bucket.
All the other Paths will refer to the bucket that we created for saving our objects at the beginning:
- Results/Model Directory:
You can access this directory using the RESULT_DIR variable. We use it if we want to save our model as output.
- Checkpoint Directory:
Use this directory to save the checkpoints of the model and access it using the CHECKPOINT_DIR variable.
- Tensorboard output when using Tensorflow:
Save your Tensorboard metrics into the folder defined by the LOG_DIR variable and add the ‘log/tb/’ path to it. That way the metrics can be displayed in Watson Studio and you can download the event file at any point from your bucket.
Example for adding each directory:
It’s easy to replace the folders we had in our code with the new folders defined above. This is the minor change that needs to be done for our code to run. This example was for Tensorflow users and this guide can help you implement the code for a Keras example, which is very similar to this example.
Now, we only need to upload our code and let it run. For this purpose, we’ll use the Watson Studio we created earlier and click on Get Started.
Enter your project and go to Settings:
We want to associate our Machine Learning Instance with our Project. In the Settings tab, go to Associated services and click Add Service Watson.
In the new window, select the Machine Learning Service and click Add.
Once that’s done, you can select your existing instance and link it to your project.
Now we’ve linked our instance to the project! Go to the Assets tab in the project and create a new experiment by clicking New Experiment.
When creating a new experiment, you can define the name and the cloud object storage. Select a cloud storage object for the training data by clicking Select and creating a new connection. To create your new connection, use the COS instance we initialized earlier and choose the Bucket that contains our training files:
Now, we do the same for our Cloud Object Storage bucket for storing results by using the same connection when clicking Select and choosing the bucket that we defined earlier for storing our results.
We now add the training definition. Here, we can define the name, upload the code (as a zip file), choose our GPU, select the Framework, and create the training definition:
Our training source code can also be composed of multiple files and they should all be included in the same .zip file.
Now click Create and run and here we go! You can check your results in the Cloud Object Storage or, if you defined Tensorboard files, those metrics should show up in your experiment.
Have fun when experimenting with Neural Networks and GPUs!