Giant-scale fashions are pretrained on huge web-crawled datasets containing paperwork of blended high quality, making knowledge filtering important. A preferred methodology is Classifier-based High quality Filtering (CQF), which trains a binary classifier to tell apart between pretraining knowledge and a small, high-quality set. It assigns every pretraining doc a top quality rating outlined because the classifier’s rating and retains solely the top-scoring ones. We offer an in-depth evaluation of CQF. We present that whereas CQF improves downstream activity efficiency, it doesn’t essentially improve language modeling on the high-quality dataset. We clarify this paradox by the truth that CQF implicitly filters the high-quality dataset as properly. We additional evaluate the habits of fashions skilled with CQF to these skilled on artificial knowledge of accelerating high quality, obtained by way of random token permutations, and discover starkly totally different developments. Our outcomes problem the view that CQF captures a significant notion of knowledge high quality.
- ‡ Work performed whereas at Apple
- § Oxford College

