Selfish Video Query Answering (QA) requires fashions to deal with long-horizon temporal reasoning, first-person views, and specialised challenges like frequent digicam motion. This paper systematically evaluates each proprietary and open-source Multimodal Massive Language Fashions (MLLMs) on QaEgo4Dv2—a refined dataset of selfish movies derived from QaEgo4D. 4 standard MLLMs (GPT-4o, Gemini-1.5-Professional, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed utilizing zero-shot and fine-tuned approaches for each OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate
annotation noise in QaEgo4D, enabling extra dependable comparability. Our outcomes present that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct obtain new state-of-the-art efficiency, surpassing earlier benchmarks by as much as +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We additionally current an intensive error evaluation, indicating the mannequin’s issue in spatial reasoning and fine-grained object recognition—key areas for future enchancment.