Current analysis demonstrated that coaching giant language fashions includes memorization of a major fraction of coaching knowledge. Such memorization can result in privateness violations when coaching on delicate person knowledge and thus motivates the research of knowledge memorization’s position in studying.
On this work, we develop a common method for proving decrease bounds on extra knowledge memorization, that depends on a brand new connection between sturdy knowledge processing inequalities and knowledge memorization. We then exhibit that a number of easy and pure binary classification issues exhibit a trade-off between the variety of samples out there to a studying algorithm, and the quantity of details about the coaching knowledge {that a} studying algorithm must memorize to be correct. Specifically, bits of details about the coaching knowledge have to be memorized when -dimensional examples can be found, which then decays because the variety of examples grows at a problem-specific fee. Additional, our decrease bounds are usually matched (as much as logarithmic elements) by easy studying algorithms. We additionally prolong our decrease bounds to extra common mixture-of-clusters fashions. Our definitions and outcomes construct on the work of Brown et al (2021) and tackle a number of limitations of the decrease bounds of their work.
- ** Work achieved whereas at Apple
- † Weizmann Institute of Science
- ‡ UC Berkeley