This paper was accepted on the Workshop on Reminiscence for LLM-Based mostly Agentic Programs at ICLR.
Language fashions have persistently grown to compress extra world information into their parameters, however the information that may be pretrained into them is upper-bounded by their parameter dimension. Particularly the capability of Small Language Fashions (SLMs) is restricted, resulting in factually incorrect generations. This drawback is commonly mitigated by giving the SLM entry to an out of doors supply: the flexibility to question a bigger mannequin, paperwork, or a database. Beneath this setting, we research the basic query of which tokens an SLM can and may be taught throughout pretraining, versus which of them it ought to delegate by way of a token. We discover that this isn’t merely a query of loss: though the loss is predictive of whether or not a predicted token mismatches the ground-truth, some tokens are acceptable in that they’re truthful different continuations of a pretraining doc, and mustn’t set off a even when their loss is excessive. We discover {that a} spaCy grammar parser may also help increase the loss sign to resolve which tokens the SLM ought to be taught to delegate to forestall factual errors and that are secure to be taught and predict even underneath excessive losses. We suggest LaCy, a novel pretraining methodology primarily based on this token choice philosophy. Our experiments reveal that LaCy fashions efficiently be taught which tokens to foretell and the place to delegate for assist. This leads to greater FactScores when producing in a cascade with an even bigger mannequin and outperforms Rho or LLM-judge educated SLMs, whereas being less complicated and cheaper.
- † College of Cambridge
- ** Work achieved whereas at Apple