This paper was accepted on the DataWorld (Information Curation) Workshop at ICML 2025.
Multimodal fashions are educated on large-scale web-crawled datasets, which frequently comprise noise, bias, and irrelevant info. This motivates using knowledge choice strategies, which will be divided into model-free variants, counting on heuristic guidelines and downstream datasets, and model-based approaches, comparable to these utilizing affect features. The previous will be costly to design and dangers introducing undesirable dataset dependencies, whereas the latter are sometimes computationally prohibitive. On this work, we suggest an environment friendly, model-based strategy utilizing the Mimic Rating, a brand new data-quality metric that leverages the weights of a reference mannequin to evaluate the usefulness of particular person samples for coaching a brand new mannequin. Our methodology depends on measuring alignments between coaching gradients and a goal path induced by this reference mannequin. Constructing on the derived mimic scores, we develop Grad-Mimic: a framework that prioritizes samples to be taught, estimates general pattern utility, and creates efficient filters. Empirically, utilizing mimic scores to information coaching improves knowledge effectivity, accelerates convergence, yields constant efficiency good points throughout six picture datasets, and enhances CLIP fashions with 20.7% fewer coaching steps. Furthermore, mimic score-based filters complement current filtering strategies, e.g., coaching improved CLIP fashions with 4.7 million fewer samples whereas providing correct estimation of dataset high quality.
- † College of Wisconsin–Madison
- ** Work executed whereas at Apple