For deep learning (neural networks) it is time consuming to tune the model for optimal performance.
To gain context, let’s review the different type of configuration levers we have in neural networks. There are two types of levers: parameters and hyperparameters. Parameters are derived from reviewing the data. The process of finding these parameters is known as model training.
The second type of configuration levers are fixed before model training begins. Examples of these include learning rate, number of clusters, number of hidden layers in a deep neural network. The number of these levers grows larger as the neural network gets larger. To evaluate a change to hyperparameters, a model training run is executed to evaluate the performance. To give an example of scale, if there are 10 hyperparameters and you wish to evaluate 4 steps for each hyper parameter, this results in 10,000 (10^4) model training runs. As an example, RoBERTa uses 17 hyperparameters. In a Convolutional neural network (CNN) the number of hyperparameters can be 20. As context, the number of parameters in a CNN can be over 62 million.
When both types of configuration levers are tuned properly, the resulting model performs better.
A naïve solution for tuning hyperparameters is grid based search. This solution has the advantage of a straightforward implementation and the ability to parallelize the training runs. Unfortunately, grid search suffers from the ‘curse of dimensionality’ and cannot scale to large number of hyperparameters.
An improvement on grid-based search is a random search. A random search replaces the discrete values chosen in a grid-based search with values chosen from a distribution for each hyperparameter. Similar to grid-based search, random search is straightforward to implement and can be parallelize well. Unfortunately, random search also suffers from the ‘curse of dimensionality’ and does not scale well.
The next type of solution is Bayesian based. The Bayesian solution uses the results of previous runs to inform choosing the hyperparameter values of the next run. When a Bayesian solution is combined with an early stopping solution for trials, it can scale to large problems. One type of early stopping solution is Asynchronous Hyper Band (ASHA). As a starting point, this blog has a good review for ASHA. For more details, see the published papers on ASHA and HyperBand. A significant drawback of Bayesian solutions is that it must run serially.
Finally, we come to Population Based Training (PBT). PBT takes the approach to start training runs in parallel. At a designated time, interval, PBT halts that training for all active training runs. Upon halting the training, the weights for the poor performing training runs are replaced with a perturbed copy of the best training runs. In this way, PBT focuses the area search space to where promising results have been found. By copying the weights, PBT can make use of information from previous runs while still providing for parallelism. A comparison was done between the different types of optimization techniques with the results and cloud costs. Additional testing information is available in the PBT paper and blog from Deep Mind.
To use PBT in practice, I provisioned a VM in Oracle Cloud with GPU's using the Terraform automation available in GitHub Repo. I used Ray Tune (version 2.0.0.dev0) and first followed the tutorial for ASHA. I used TensorFlow (version 2.1.3 and 2.1.4) as well as PyTorch (version 1.5) While the tutorial is focused on PyTorch I had no issues implemented the same changes for TensorFlow.
Finally, while PBT example is more involved, implementing it was relatively straightforward. An item that was a bit annoying is that when using PyTorch with Ray Tune, hparams are not automatically populated for TensorBoard. Another hiccup with Ray is when using environments based on OpenGym, you must register the environment with register_env otherwise the environment is not recognized by the rollout workers.
from ray.tune.registry import register_env
register_env("Environment name here", lambda config: env(config))
I configured the ranges for the hyperparameters as:
from ray import tune
config["clip_param"] = tune.sample_from(lambda spec: random.uniform(0.1, 0.5))
config["lambda"] = tune.sample_from(lambda spec: random.uniform(0.9, 1.0))
config["lr"] = tune.sample_from(lambda spec: random.uniform(1e-3, 1e-5))
config["train_batch_size"] = tune.sample_from(
lambda spec: random.randint(1000, 60000))
config["num_sgd_iter"] = tune.sample_from(lambda: random.randint(1, 30))
config["sgd_minibatch_size"] = tune.sample_from(
lambda: random.randint(128, 16384))
To initiate the tuning run, I used the following snippet:
analysis = tune.run(
"PPO",
name="{}_{}_ray_{}_tf_{}".format(
timelog, "PPO", ray.__version__, tf.__version__),
scheduler=pbt,
num_samples=num_samples,
metric="episode_reward_mean",
checkpoint_freq=1,
checkpoint_score_attr="episode_reward_mean",
mode="max",
verbose=1,
stop={"training_iteration": 50},
config=config)
Using Ray Tune, it is practical to implement state of the art hyperparameter tuning.