We current StreamBridge, a easy but efficient framework that seamlessly transforms offline Video-LLMs into streaming-capable fashions. It addresses two basic challenges in adapting present fashions into on-line eventualities: (1) restricted functionality for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Particularly, StreamBridge incorporates (1) a reminiscence buffer mixed with a round-decayed compression technique, supporting long-context multi-turn interactions, and (2) a decoupled, light-weight activation mannequin that may be effortlessly built-in into present Video-LLMs, enabling steady proactive responses. To additional assist StreamBridge, we assemble Stream-IT, a large-scale dataset tailor-made for streaming video understanding, that includes interleaved video-text sequences and numerous instruction codecs. Intensive experiments present that StreamBridge considerably improves the streaming understanding capabilities of offline Video-LLMs throughout varied duties, outperforming even proprietary fashions corresponding to GPT-4o and Gemini 1.5 Professional. Concurrently, it achieves aggressive or superior efficiency on commonplace video understanding benchmarks.
† Fudan College
‡‡ Work performed throughout Apple internship