ETVA: Analysis of Textual content-to-Video Alignment through High quality-grained Query Technology and Answering

Exactly evaluating semantic alignment between textual content prompts and generated movies stays a problem in Textual content-to-Video (T2V) Technology. Present text-to-video alignment metrics like CLIPScore solely generate coarse-grained scores with out fine-grained alignment particulars, failing to align with human choice. To deal with this limitation, we suggest ETVA, a novel Analysis methodology of Textual content-to-Video Alignment through fine-grained query technology and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for query answering, the place an auxiliary LLM first retrieves related commonsense information (e.g., bodily legal guidelines), after which video LLM reply the generated questions via a multi-stage reasoning mechanism. In depth experiments show that ETVA achieves a Spearman’s correlation coefficient of 58.47, displaying a lot larger correlation with human judgment than present metrics which attain solely 31.0. We additionally assemble a complete benchmark particularly designed for text-to-video alignment analysis, that includes 2k numerous prompts and 12k atomic questions spanning 10 classes. By means of a scientific analysis of 15 present text-to-video fashions, we establish their key capabilities and limitations, paving the best way for next-generation T2V technology. All codes and datasets can be publicly accessible quickly.

** Work performed whereas at Apple
† Renmin College of China

Main Menu

What's Hot

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

AMC Robotics and HIVE Announce Collaboration to Advance AI-Pushed Robotics Compute Infrastructure

ETVA: Analysis of Textual content-to-Video Alignment through High quality-grained Query Technology and Answering

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

P-EAGLE: Quicker LLM inference with Parallel Speculative Decoding in vLLM

We Used 5 Outlier Detection Strategies on a Actual Dataset: They Disagreed on 96% of Flagged Samples

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

AMC Robotics and HIVE Announce Collaboration to Advance AI-Pushed Robotics Compute Infrastructure

Tremble Chatbot App Entry, Prices, and Characteristic Insights

Main Menu

Subscribe to Updates

What's Hot

ETVA: Analysis of Textual content-to-Video Alignment through High quality-grained Query Technology and Answering

Related Posts