Hyperparameter tuning can dramatically affect coaching stability and last efficiency of large-scale fashions. Current works on neural community parameterisations, equivalent to μP, have enabled switch of optimum international hyperparameters throughout mannequin sizes. These works suggest an empirical observe of seek for optimum international base hyperparameters at a small mannequin measurement, and switch to a big measurement. We prolong these works in two key methods. To deal with scaling alongside most vital scaling axes, we suggest the Full(d) Parameterisation that unifies scaling in width and depth — utilizing an adaptation of CompleteP — in addition to in batch-size and coaching length. Secondly, with our parameterisation, we examine per-module hyperparameter optimisation and switch. We characterise the empirical challenges of navigating the high-dimensional hyperparameter panorama, and suggest sensible pointers for tackling this optimisation drawback. We reveal that, with the precise parameterisation, hyperparameter switch holds even within the per-module hyperparameter regime. Our examine covers an intensive vary of optimisation hyperparameters of contemporary fashions: studying charges, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments reveal vital coaching pace enhancements in Massive Language Fashions with the transferred per-module hyperparameters.
- † College of Cambridge
- ** Work carried out whereas at Apple

