Chain-of-thought (CoT) reasoning in imaginative and prescient language
fashions (VLMs) is essential for enhancing
interpretability and trustworthiness. Nonetheless,
present coaching recipes typically counting on
datasets dominated by quick annotations with
minimal rationales. On this work, we present that
coaching VLM on quick solutions results in poor
generalization on reasoning duties that require
extra detailed explanations. To handle this limitation,
we suggest a two-stage post-training
technique that extends the utilization of quick reply
information for enhanced CoT reasoning. First, we
increase quick solutions with CoT reasoning
generated by GPT-4o, enhancing the VLM’s
CoT capabilities via fine-tuning. Second,
we leverage quick solutions as end result rewards
for reinforcement studying. Particularly, quick
solutions are used as correctness indicators to
assemble optimistic (appropriate) and unfavourable (incorrect)
pairs from model-generated reasoning
chains. These pairs are then used to calibrate
the mannequin’s reasoning by way of Direct Desire Optimization.
Our experiments present important
enhancements in CoT reasoning on benchmark
datasets, together with enhanced generalization to
direct reply prediction. This work supplies
a essential information useful resource for VLM CoT coaching
and demonstrates the effectiveness of end result
rewards for multimodal fashions post-training.
- † Work achieved whereas at Apple
- ‡ Carnegie Mellon College