What analysis may be pursued with small fashions educated to finish true packages? Usually, researchers examine program synthesis by way of giant language fashions (LLMs) which introduce points corresponding to understanding what’s in or out of distribution, understanding fine-tuning results, understanding the consequences of tokenization, and better demand on compute and storage to hold out experiments. We current a system referred to as Cadmus which incorporates an integer digital machine (VM), a dataset composed of true packages of various duties, and an autoregressive transformer mannequin that’s educated for beneath $200 of compute price. The system can be utilized to check program completion, out-of-distribution representations, inductive reasoning, and instruction following in a setting the place researchers have efficient and inexpensive fine-grained management of the coaching distribution and the flexibility to examine and instrument fashions. Smaller fashions engaged on complicated reasoning duties allow instrumentation and investigations which may be prohibitively costly on bigger fashions. To reveal that these duties are complicated sufficient to be of curiosity, we present that these Cadmus fashions outperform GPT-5 (by reaching 100% accuracy whereas GPT-5 has 95% accuracy) even on a easy process of finishing appropriate, integer arithmetic packages in our domain-specific language (DSL) whereas offering transparency into the dataset’s relationship to the issue. We additionally present that GPT-5 brings unknown priors into its reasoning course of when fixing the identical duties, demonstrating a confounding issue that stops the usage of large-scale LLMs for some investigations the place the coaching set relationship to the duty must be absolutely understood.
- ** Work finished whereas at Apple

