Tokenization in video fashions, usually by means of patchification, generates an extreme and redundant variety of tokens. This severely limits video effectivity and scalability. Whereas latest trajectory-based tokenizers supply a promising answer by decoupling video length from token depend, they depend on advanced exterior segmentation and monitoring pipelines which might be sluggish and task-agnostic. We suggest TrajTok, an end-to-end video tokenizer module that’s totally built-in and co-trained with video fashions for a downstream goal, dynamically adapting its token granularity to semantic complexity, unbiased of video length. TrajTok incorporates a unified segmenter that performs implicit clustering over pixels in each house and time to straight produce object trajectories in a single ahead move. By prioritizing downstream adaptability over pixel-perfect segmentation constancy, TrajTok is light-weight and environment friendly, but empirically improves video understanding efficiency. With TrajTok, we implement a video CLIP mannequin educated from scratch (TrajViT2). It achieves one of the best accuracy at scale throughout each classification and retrieval benchmarks, whereas sustaining effectivity akin to one of the best token-merging strategies. TrajTok additionally proves to be a flexible part past its position as a tokenizer. We present that it may be seamlessly built-in as both a probing head for pretrained visible options (TrajAdapter) or an alignment connector in vision-language fashions (TrajVLM) with particularly sturdy efficiency in long-video reasoning.
- † College of Washington
- ‡ Allen Institute for Synthetic Intelligence (AI2)
- § Woven by Toyota, Inc.

