We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a household of video massive language fashions (LLMs) providing a token-efficient answer for long-form video understanding. We incorporate the two-stream SlowFast mechanism right into a streamlined coaching pipeline, and carry out joint video-image coaching on a fastidiously curated information combination of solely publicly accessible datasets. Our major focus is on extremely environment friendly mannequin scales (1B and 3B), demonstrating that even comparatively small Video LLMs can obtain state-of-the-art efficiency on video understanding, assembly the demand for mobile-friendly fashions. Experimental outcomes display that SF-LLaVA-1.5 achieves superior efficiency on a variety of video and picture duties, with sturdy outcomes in any respect mannequin sizes (starting from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art ends in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales throughout numerous video benchmarks.

