Conditional diffusion fashions seem able to compositional generalization, i.e., producing convincing samples for out-of-distribution combos of conditioners, however the mechanisms underlying this capability stay unclear. To make this concrete, we research size generalization, the flexibility to generate pictures with extra objects than seen throughout coaching. In a managed CLEVR setting (Johnson et al., 2017), we discover that size generalization is achievable in some instances however not others, suggesting that fashions solely typically study the underlying compositional construction. We then examine locality as a structural mechanism for compositional generalization. Prior works proposed rating locality as a mechanism for creativity in unconditional diffusion fashions (Kamb & Ganguli, 2024; Niedoba et al., 2024), however didn’t handle versatile conditioning or compositional generalization. On this paper, we show a precise equivalence between a particular compositional construction (“conditional projective composition”) (Bradley et al., 2025) and scores with sparse dependencies on each pixels and conditioners (“native conditional scores”). This concept additionally extends to feature-space compositionality. We validate our concept empirically: CLEVR fashions that succeed at size generalization exhibit native conditional scores, whereas those who fail don’t. Moreover, we present {that a} causal intervention explicitly implementing native conditional scores restores size generalization in a beforehand failing mannequin. Lastly, we examine feature-space compositionality in color-conditioned CLEVR, and discover preliminary proof of compositional construction in SDXL.

