Choosing which hyper-parameters to optimize is not an easy task since some are more sensitive than others and are dependent upon the choice of model.
Low sensitivity: Optimizer, batch size, non-linearity.
Medium sensitivity: weight initialization, model depth, layer parameters, weight of regularization.
High sensitivity: learning rate, annealing schedule, loss function, layer size.
Method 1 is manual optimization:
For a skilled practitioner, this may require the least amount of computation to get good results.
However, the method is time-consuming and requires a detailed understanding of the algorithm.
Method 2 is grid search:
Grid search is super simple to implement and can produce good results.
Unfortunately, it’s not very efficient since we need to train the model on all cross-combinations of the hyper-parameters. It also requires prior knowledge about the parameters to get good results.
Method 3 is random search:
Random search is also easy to implement and often produces better results than grid search.
But it is not very interpretable and may also require prior knowledge about the parameters to get good results.
Method 4 is coarse-to-fine search:
This strategy helps you narrow in only on very high performing hyper-parameters and is a common practice in the industry.
The only drawback is that it is somewhat a manual process.
Method 5 is Bayesian optimization search:
Bayesian optimization is generally the most efficient hands-off way to choose hyper-parameters.
But it’s difficult to implement from scratch and can be hard to integrate with off-the-shelf tools.