Diffusion transformers have been extensively adopted for text-to-image synthesis. Whereas scaling these fashions as much as billions of parameters reveals promise, the effectiveness of scaling past present sizes stays underexplored and difficult. By explicitly exploiting the computational heterogeneity of picture generations, we develop a brand new household of Combination-of-Consultants (MoE) fashions (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allotted to grasp the enter texts and generate the respective picture patches, enabling heterogeneous computation aligned with various text-image complexities. This heterogeneity supplies an environment friendly means of scaling EC-DIT as much as 97 billion parameters and attaining vital enhancements in coaching convergence, text-to-image alignment, and general era high quality over dense fashions and traditional MoE fashions. By means of in depth ablations, we present that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing various textual significance by means of end-to-end coaching. Notably, in text-to-image alignment analysis, our largest fashions obtain a state-of-the-art GenEval rating of 71.68% and nonetheless keep aggressive inference velocity with intuitive interpretability.
†Work finished throughout an Apple internship.
‡Georgia Institute of Know-how