Exactly evaluating semantic alignment between textual content prompts and generated movies stays a problem in Textual content-to-Video (T2V) Technology. Present text-to-video alignment metrics like CLIPScore solely generate coarse-grained scores with out fine-grained alignment particulars, failing to align with human choice. To deal with this limitation, we suggest ETVA, a novel Analysis methodology of Textual content-to-Video Alignment through fine-grained query technology and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for query answering, the place an auxiliary LLM first retrieves related commonsense information (e.g., bodily legal guidelines), after which video LLM reply the generated questions via a multi-stage reasoning mechanism. In depth experiments show that ETVA achieves a Spearman’s correlation coefficient of 58.47, displaying a lot larger correlation with human judgment than present metrics which attain solely 31.0. We additionally assemble a complete benchmark particularly designed for text-to-video alignment analysis, that includes 2k numerous prompts and 12k atomic questions spanning 10 classes. By means of a scientific analysis of 15 present text-to-video fashions, we establish their key capabilities and limitations, paving the best way for next-generation T2V technology. All codes and datasets can be publicly accessible quickly.
- ** Work performed whereas at Apple
- † Renmin College of China

