Video-conditioned sound and speech technology, encompassing video-to-sound (V2S) and visible text-to-speech (VisualTTS) duties, are conventionally addressed as separate duties, with restricted exploration to unify them inside a signle framework. Current makes an attempt to unify V2S and VisualTTS face challenges in dealing with distinct situation sorts (e.g., heterogeneous video and transcript circumstances) and require advanced coaching levels. Unifying these two duties stays an open drawback. To bridge this hole, we current VSSFlow, which seamlessly integrates each V2S and VisualTTS duties right into a unified flow-matching framework. VSSFlow makes use of a novel situation aggregation mechanism to deal with distinct enter indicators. We discover that cross-attention and self-attention layer exhibit totally different inductive biases within the strategy of introducing situation. Subsequently, VSSFlow leverages these inductive biases to successfully deal with totally different representations: cross-attention for ambiguous video circumstances and self-attention for extra deterministic speech transcripts. Moreover, opposite to the prevailing perception that joint coaching on the 2 duties requires advanced coaching methods and should degrade efficiency, we discover that VSSFlow advantages from the end-to-end joint studying course of for sound and speech technology with out additional designs on coaching levels. Detailed evaluation attributes it to the realized common audio prior shared between duties, which accelerates convergence, enhances conditional technology, and stabilizes the classifier-free steerage course of. In depth experiments reveal that VSSFlow surpasses the state-of-the-art domain-specific baselines on each V2S and VisualTTS benchmarks, underscoring the crucial potential of unified generative fashions.
- † Renmin College of China

