VSSFlow: Unifying Video-conditioned Sound and Speech Technology through Joint Studying

Video-conditioned sound and speech technology, encompassing video-to-sound (V2S) and visible text-to-speech (VisualTTS) duties, are conventionally addressed as separate duties, with restricted exploration to unify them inside a signle framework. Current makes an attempt to unify V2S and VisualTTS face challenges in dealing with distinct situation sorts (e.g., heterogeneous video and transcript circumstances) and require advanced coaching levels. Unifying these two duties stays an open drawback. To bridge this hole, we current VSSFlow, which seamlessly integrates each V2S and VisualTTS duties right into a unified flow-matching framework. VSSFlow makes use of a novel situation aggregation mechanism to deal with distinct enter indicators. We discover that cross-attention and self-attention layer exhibit totally different inductive biases within the strategy of introducing situation. Subsequently, VSSFlow leverages these inductive biases to successfully deal with totally different representations: cross-attention for ambiguous video circumstances and self-attention for extra deterministic speech transcripts. Moreover, opposite to the prevailing perception that joint coaching on the 2 duties requires advanced coaching methods and should degrade efficiency, we discover that VSSFlow advantages from the end-to-end joint studying course of for sound and speech technology with out additional designs on coaching levels. Detailed evaluation attributes it to the realized common audio prior shared between duties, which accelerates convergence, enhances conditional technology, and stabilizes the classifier-free steerage course of. In depth experiments reveal that VSSFlow surpasses the state-of-the-art domain-specific baselines on each V2S and VisualTTS benchmarks, underscoring the crucial potential of unified generative fashions.

† Renmin College of China

Main Menu

What's Hot

Pricing Breakdown and Core Characteristic Overview

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

VSSFlow: Unifying Video-conditioned Sound and Speech Technology through Joint Studying

We ran 16 AI Fashions on 9,000+ Actual Paperwork. Here is What We Discovered.

Quick Paths and Sluggish Paths – O’Reilly

Speed up customized LLM deployment: Effective-tune with Oumi and deploy to Amazon Bedrock

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Pricing Breakdown and Core Characteristic Overview

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

How To Change A Company Tradition With Kate Johnson, CEO of Lumen Applied sciences

Main Menu

Subscribe to Updates

What's Hot

VSSFlow: Unifying Video-conditioned Sound and Speech Technology through Joint Studying

Related Posts