Search-augmented massive language fashions (LLMs) excel at knowledge-intensive duties by integrating exterior retrieval.
Nonetheless, they typically over-search – unnecessarily invoking search software even when it doesn’t enhance response high quality,
which results in computational inefficiency and hallucinations by incorporating irrelevant context. On this work, we conduct a
systematic analysis of over-searching throughout a number of dimensions, together with question varieties, mannequin classes, retrieval
circumstances, and multi-turn conversations. Our discovering exhibits: (i) search typically improves reply accuracy on answerable
queries however harms abstention on unanswerable ones; (ii) over-searching is extra pronounced in complicated reasoning fashions
and deep analysis techniques, is exacerbated by noisy retrieval, and compounds throughout turns in multi-turn conversations; and
(iii) the composition of retrieved proof is essential, because the presence of unfavourable proof improves abstention. To quantify
over-searching, we introduce Tokens Per Correctness (TPC), an analysis metric that captures the performance-cost
trade-off for search-augmented LLMs. Lastly, we examine mitigation approaches at each the question and retrieval ranges
and launch the OverSearchQA benchmark to foster continued analysis into environment friendly search-augmented LLMs.
- † Duke College
- ** Work carried out whereas at Apple

