FastVLM: Environment friendly Imaginative and prescient encoding for Imaginative and prescient Language Fashions

Scaling the enter picture decision is important for enhancing the efficiency of Imaginative and prescient Language Fashions (VLMs), notably in text-rich picture understanding duties. Nevertheless, widespread visible encoders equivalent to ViTs turn into inefficient at excessive resolutions because of the giant variety of tokens and excessive encoding latency. At completely different operational resolutions, the imaginative and prescient encoder of a VLM might be optimized alongside two axes: decreasing encoding latency and minimizing the variety of visible tokens handed to the LLM, thereby decreasing general latency. Primarily based on a complete effectivity evaluation of the interaction between picture decision, imaginative and prescient latency, token rely, and LLM dimension, we introduce FastVLM—a mannequin that achieves an optimized trade-off between decision, latency, and accuracy. FastVLM incorporates FastViTHD, a novel hybrid imaginative and prescient encoder designed to output fewer tokens and considerably scale back encoding time for high-resolution photos. Not like earlier strategies, FastVLM achieves the optimum steadiness between visible token rely and picture decision solely by scaling the enter picture, eliminating the necessity for added token pruning and simplifying the mannequin design. Within the LLaVA-1.5 setup, FastVLM achieves 3.2x enchancment in time-to-first-token (TTFT) whereas sustaining comparable efficiency on VLM benchmarks in comparison with prior works. In comparison with LLaVa-OneVision on the highest decision (1152×1152), FastVLM achieves comparable efficiency on key benchmarks like SeedBench and MMMU, utilizing the identical 0.5B LLM, however with 85x quicker TTFT and a imaginative and prescient encoder that’s 3.4x smaller.

Main Menu

What's Hot

Microsoft Limits IE Mode in Edge After Chakra Zero-Day Exercise Detected

A Quarter of the CDC Is Gone

The #1 Podcast To Make You A Higher Chief In 2024

FastVLM: Environment friendly Imaginative and prescient encoding for Imaginative and prescient Language Fashions

Enlightenment – O’Reilly

EncQA: Benchmarking Imaginative and prescient-Language Fashions on Visible Encodings for Charts

Remodeling the bodily world with AI: the subsequent frontier in clever automation

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Microsoft Limits IE Mode in Edge After Chakra Zero-Day Exercise Detected

A Quarter of the CDC Is Gone

The #1 Podcast To Make You A Higher Chief In 2024

Enlightenment – O’Reilly

Main Menu

Subscribe to Updates

What's Hot

FastVLM: Environment friendly Imaginative and prescient encoding for Imaginative and prescient Language Fashions

Related Posts