X

Best Practices from Oracle Development's A‑Team

See How Easily You Can Do parallel machine learning models training

Rajesh Chawla
Principal Cloud Architect

I have a use case to train several hundred models. Each model is trained using data with the same shape. When I ran the training for the first time, each model took about 12 to 15 minutes to train so the time to train several hundred is estimated at 20 to 25 hours. Unfortunately, this was beyond the time window I have to train all the models in. As I searched for alternatives, I started exploring how to training in parallel.

I came up with several scenarios for training the models:

  1. Train serially
  2. Python threading
  3. Python Pooling
  4. Ray remote functions
  5. Ray actors
  6. Ray functions / actors in a cluster

In this blog, I’ll share my thoughts on the two Python alternatives, while a subsequent blogs will review the work with Ray remote functions, actors and clusters.

The thought for option 1 is a brute force approach to the problem. I can get to a solution, but as noted, the amount of time spent is not ideal.

The idea for option 2 (Python threading) is to spawn a thread for each model trained, and then wait for all of the threads to complete. This approach produced immediate time improvements. As I examined the code, several pros / cons became apparent. The advantage to this approach:

  • no additional libraries to download or configure.
  • significant performance boost (depending on hardware) to the serial approach.

The down side of this approach are:

  • I have to manage all the threads myself. That is, if training fails on a thread, how do I know it failed and recover? I could work around this issue adding code to catch errors and respawn a thread on failure.
  • Depending on how many threads I spawn, I may not use all the hardware at my disposal. I could work around this issue by interrogating the hardware and configuring the number of threads I spawn dynamically.
  • I can scale up but not out. That is, I can continue to improve performance by getting a bigger machine, but managing a cluster of machines is hard.
  • I did not have any distributed programming primitives. That is, state management or message passing semantics between threads are my responsibility.

My takeaway from this approach is that it’s trivial to get started, there is a performance boost, but there are quite a few limitations and it’s not a code base that I think would be fun to maintain.

A brief snippet of code to execute in parallel is below. The code spawns one thread for each row in the array variable named ‘backorder_array’. The code to train the model is encapsulated in the function trainPPO and is the same between all the scenarios tests. The code consists of three loops. The first loop creates the threads, the second starts execution in the threads and finally, the third loop waits for all the threads to complete. 

for row in backorder_array:

    t = Thread(target=trainPPO, args=(row[0],))

    threads.append(t)

 

for x in threads:

    x.start()

 

for x in threads:

    x.join()

 

Using this approach, I was able to train 10 models in 30 minutes. I expect with tuning this amount of time to train the models would go down. I ran the code in OCI using a VM.GPU2.1 shape that I spun up using terraform as I described in a previous blog.  

I abandoned this approach because of the additional work it would take to create a robust solution.

As I looked at Python Pool as an option, I did see a more elegant programming model with Future objects, but I did not understand how this approach would address any of the significant issues that I had with the first Python approach.

So, I dropped the Python Pool option as well and began to look into the approach for distributed programming provided by Ray (https://github.com/ray-project/ray). As a preview for the next blog, I found Ray to have a much steeper learning curve than the Python primitives and in return provided the ability to scale up / out with support for well-defined distributed design patterns.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha