I have a use case to train several hundred models. Each model is trained using data with the same shape. When I ran the training for the first time, each model took about 12 to 15 minutes to train so the time to train several hundred is estimated at 20 to 25 hours. Unfortunately, this was beyond the time window I have to train all the models in. As I searched for alternatives, I started exploring how to training in parallel.
I came up with several scenarios for training the models:
In this blog, I’ll share my thoughts on the two Python alternatives, while a subsequent blogs will review the work with Ray remote functions, actors and clusters.
The thought for option 1 is a brute force approach to the problem. I can get to a solution, but as noted, the amount of time spent is not ideal.
The idea for option 2 (Python threading) is to spawn a thread for each model trained, and then wait for all of the threads to complete. This approach produced immediate time improvements. As I examined the code, several pros / cons became apparent. The advantage to this approach:
The down side of this approach are:
My takeaway from this approach is that it’s trivial to get started, there is a performance boost, but there are quite a few limitations and it’s not a code base that I think would be fun to maintain.
A brief snippet of code to execute in parallel is below. The code spawns one thread for each row in the array variable named ‘backorder_array’. The code to train the model is encapsulated in the function trainPPO and is the same between all the scenarios tests. The code consists of three loops. The first loop creates the threads, the second starts execution in the threads and finally, the third loop waits for all the threads to complete.
for row in backorder_array:
t = Thread(target=trainPPO, args=(row,))
for x in threads:
for x in threads:
Using this approach, I was able to train 10 models in 30 minutes. I expect with tuning this amount of time to train the models would go down. I ran the code in OCI using a VM.GPU2.1 shape that I spun up using terraform as I described in a previous blog.
I abandoned this approach because of the additional work it would take to create a robust solution.
As I looked at Python Pool as an option, I did see a more elegant programming model with Future objects, but I did not understand how this approach would address any of the significant issues that I had with the first Python approach.
So, I dropped the Python Pool option as well and began to look into the approach for distributed programming provided by Ray (https://github.com/ray-project/ray). As a preview for the next blog, I found Ray to have a much steeper learning curve than the Python primitives and in return provided the ability to scale up / out with support for well-defined distributed design patterns.