The Knowledge-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining

Giant-scale fashions are pretrained on huge web-crawled datasets containing paperwork of blended high quality, making knowledge filtering important. A preferred methodology is Classifier-based High quality Filtering (CQF), which trains a binary classifier to tell apart between pretraining knowledge and a small, high-quality set. It assigns every pretraining doc a top quality rating outlined because the classifier’s rating and retains solely the top-scoring ones. We offer an in-depth evaluation of CQF. We present that whereas CQF improves downstream activity efficiency, it doesn’t essentially improve language modeling on the high-quality dataset. We clarify this paradox by the truth that CQF implicitly filters the high-quality dataset as properly. We additional evaluate the habits of fashions skilled with CQF to these skilled on artificial knowledge of accelerating high quality, obtained by way of random token permutations, and discover starkly totally different developments. Our outcomes problem the view that CQF captures a significant notion of knowledge high quality.

‡ Work performed whereas at Apple
§ Oxford College

Determine 1: Classifier-based High quality Filtering (CQF) pipeline. A doc embedding mannequin (e.g. sBert, Artic-Embed, or FastText) embeds paperwork from a high-quality dataset and the pretraining set. A binary classifier is skilled on these embeddings to tell apart the HQ set from the pretraining set. Scores assigned by the classifier are used to rank paperwork from the pretraining set. The highest ok fraction of these paperwork constitutes the brand new filtered CQF dataset.

Main Menu

What's Hot

Quiet Cracking: The Emotional Threat Lurking Inside Automated HR

Selecting Between PCA and t-SNE for Visualization

Faux Zoom and Google Meet Pages Trick Customers Into Putting in Monitoring Instrument

The Knowledge-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Constructing a scalable digital try-on resolution utilizing Amazon Nova on AWS: half 1

Getting Began with Python Async Programming

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Quiet Cracking: The Emotional Threat Lurking Inside Automated HR

Selecting Between PCA and t-SNE for Visualization

Faux Zoom and Google Meet Pages Trick Customers Into Putting in Monitoring Instrument

I attempted Lenovo’s modular ThinkBook laptop computer, and it is a idea I would really root for

Main Menu

Subscribe to Updates

What's Hot

The Knowledge-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining

Related Posts