Whereas scaling legal guidelines for Massive Language Fashions (LLMs) historically concentrate on proxy metrics like pretraining loss, predicting downstream activity efficiency has been thought of unreliable. This paper challenges that view by proposing a direct framework to mannequin the scaling of benchmark efficiency from the coaching price range. We discover that for a set token-to-parameter ratio, a easy energy legislation can precisely describe the scaling conduct of log accuracy on a number of widespread downstream duties. Our outcomes present that the direct strategy extrapolates higher than the beforehand proposed two-stage process, which is susceptible to compounding errors. Moreover, we introduce purposeful types that predict accuracy throughout token-to-parameter ratios and account for inference compute below repeated sampling. We validate our findings on fashions with as much as 17B parameters skilled on as much as 350B tokens throughout two dataset mixtures. To help reproducibility and encourage future analysis, we launch the entire set of pretraining losses and downstream analysis outcomes.
- ** Work accomplished whereas at Apple

