Dense picture captioning is crucial for cross-modal alignment in vision-language pretraining and text-to-image technology, however scaling expert-quality annotations is prohibitively costly. Whereas artificial captioning through sturdy vision-language fashions (VLMs) is a sensible different, supervised distillation typically yields restricted output range and weak generalization. Reinforcement studying (RL) might overcome these limitations, however its successes have to this point been concentrated in verifiable domains that depend on deterministic checkers — a luxurious not accessible in open-ended captioning. We handle this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward alerts from LLM-written rubrics. RubiCap first assembles a various committee of candidate captions, then employs an LLM rubric author to extract consensus strengths and diagnose deficiencies within the present coverage. These insights are transformed into specific analysis standards, enabling an LLM choose to decompose holistic high quality evaluation and substitute coarse scalar rewards with structured, multi-faceted evaluations. Throughout in depth benchmarks, RubiCap achieves the very best win charges on CapArena, outperforming supervised distillation, prior RL strategies, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior phrase effectivity: our 7B mannequin matches Qwen2.5-VL-32B-Instruct, and our 3B mannequin surpasses its 7B counterpart. Remarkably, utilizing the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than these skilled on captions from proprietary fashions.
- † College of Wisconsin–Madison
- ** Work executed whereas at Apple

