Diffusion Language Fashions (DLMs) have emerged as a promising new paradigm for textual content generative modeling, doubtlessly addressing limitations of autoregressive (AR) fashions. Nonetheless, present DLMs have been studied at a smaller scale in comparison with their AR counterparts and lack truthful comparability on language modeling benchmarks. Moreover, coaching diffusion fashions from scratch at scale stays difficult. Given the prevalence of open-source AR language fashions, we suggest adapting these fashions to construct textual content diffusion fashions. We display connections between AR and diffusion modeling aims and introduce a easy continuous pre-training strategy for coaching diffusion fashions. Via systematic analysis on language modeling, reasoning, and commonsense benchmarks, we present that we will convert AR fashions starting from 127M to 7B parameters (GPT2 and LLaMA) into diffusion fashions DiffuGPT and DiffuLLaMA, utilizing lower than 200B tokens for coaching. Our experimental outcomes reveal that these fashions outperform earlier DLMs and are aggressive with their AR counterparts. We launch a collection of DLMs (127M-355M-7B) able to producing fluent textual content, performing in-context studying, filling within the center with out immediate re-ordering, and following directions.
† The College of Hong Kong
‡ College of Illinois at Urbana-Champaign
§ Tencent AI Lab