Diffusion fashions have develop into the dominant method for visible era. They’re educated by denoising a Markovian course of which steadily provides noise to the enter. We argue that the Markovian property limits the mannequin’s capacity to completely make the most of the era trajectory, resulting in inefficiencies throughout coaching and inference. On this paper, we suggest DART, a transformer-based mannequin that unifies autoregressive (AR) and diffusion inside a non-Markovian framework. DART iteratively denoises picture patches spatially and spectrally utilizing an AR mannequin that has the identical structure as commonplace language fashions. DART doesn’t depend on picture quantization, which allows simpler picture modeling whereas sustaining flexibility. Moreover, DART seamlessly trains with each textual content and picture knowledge in a unified mannequin. Our method demonstrates aggressive efficiency on class-conditioned and text-to-image era duties, providing a scalable, environment friendly different to conventional diffusion fashions. By means of this unified framework, DART units a brand new benchmark for scalable, high-quality picture synthesis.
† Work accomplished throughout an internship at Apple.
‡ The Chinese language College of Hong Kong
§ Mila