Deploy machine learning environment in OCI IaaS Like a Pro

October 6, 2020 | 3 minute read
Rajesh Chawla
Principal Cloud Architect
Text Size 100%:

I found myself continually creating and tearing down IaaS resources for my machine learning experiments. Each time, I would do things just a bit differently, such as naming of resources or forget some step, such as to turn on backups for the disks. So, to save myself time I automated the procedure with Terraform.

My goal is to create a standalone environment for me to run experiments in. The experiments should be on the computer power I need (CPU / GPU), have Jupyter notebooks configured, perform backups automatically and finally be locked down from a networking / security point of view.

As I wished to leverage all tools I had at my disposal, I choose to automate the scripts in Terraform and use the existing Data Science VM from the HPC team in the OCI marketplace. The HPC team has done the work to ensure the GPU drivers are configured properly as well as pre-installed many of the packages I wanted to work with including TensorFlow, PyTorch and Keras. In the automation I wrote, I added a few more things that I use, which are connectivity to ATP / ADW. This required the oracle client libraries as well as the python libraries cx_oracle python package. See overview for info on cx_oracle. I also found that a bit of tweaking resulting in better performance. As with all performance tuning, your mileage will vary, but for my data, I found the most value in tuning the arraySize and preFetch parameters.

For the python packages I added, I placed them in the 'mlenv' python environment, which was preconfigured in the !Data Science VM. I also took the step of registering the kernel with the Jupyter notebooks via the python package ipykernel. This way, I could choose the 'mlenv' kernel from the Jupyter notebooks.

As another safety precaution, from my tendency to fat finger things, I turned on backups for all the disks I allocated in the VM. By default, I turned on silver policy, but you can easily tweak this to gold or bronze level.

The network I created for this is straightforward. The environment consists of a vcn, 2 subnets, a route table and ingress rules locked down except for port 22. You can see the environment in the diagram below.

I wanted the system secure, but I really didn't want to hassle with firewall rules. I also needed to access this environment from multiple places on the Internet. Since, the environment was for my individual use and a few of my colleagues so I opted for SSL tunneling as a way to get access to the Jupyter notebooks. I realize this is not a valid solution for larger teams, however, I suggest that Oracle Data Science  or Oracle Machine Learning may be better tools for that use case.

To deploy this solution, grab the code from github quick start and follow the instructions in the README. If you are familiar with Terraform, this should be an easy configuration for you. If you're not familiar with Terraform, you can review introductory materials from Hashicorp or check out this course about Terraform on OCI. After you’ve downloaded the Terraform code, the next steps boil down to:

  • Get an OCI account
  • Configure public / private keys
  • Tweak the configuration files for your interest
  • Execute the Terraform

This automation has saved me quite a bit of time and I’m curious to hear if this has value for you as well.

Rajesh Chawla

Principal Cloud Architect

Principal Cloud Solution Architect at Oracle focused on machine learning & IaaS


Previous Post

A Quick Note on Using JWT Token Authentication with Oracle SaaS API

Siming Mu | 3 min read

Next Post


Integrate Oracle Cloud Guard with External Systems Using OCI Events and Functions

Pulkit Sharma | 6 min read