Giant language fashions (LLMs) usually generate hallucinations — unsupported content material that undermines reliability. Whereas most prior works body hallucination detection as a binary activity, many real-world purposes require figuring out hallucinated spans, which is a multi-step choice making course of. This naturally raises the query of whether or not express reasoning might help the advanced activity of detecting hallucination spans. To reply this query, we first consider pretrained fashions with and with out Chain-of-Thought (CoT) reasoning, and present that CoT reasoning has the potential to generate at the very least one right reply when sampled a number of instances. Motivated by this, we suggest RL4HS, a reinforcement studying framework that incentivizes reasoning with a span-level reward operate. RL4HS builds on Group Relative Coverage Optimization and introduces Class-Conscious Coverage Optimization to mitigate reward imbalance difficulty. Experiments on the RAGTruth benchmark (summarization, query answering, data-to-text) present that RL4HS surpasses pretrained reasoning fashions and supervised fine-tuning, demonstrating the need of reinforcement studying with span-level rewards for detecting hallucination spans.
- † Nationwide Taiwan College, Taiwan

