Coverage gradient algorithms have pushed many current developments in language mannequin reasoning. An interesting property is their potential to be taught from exploration on their very own trajectories, a course of essential for fostering various and inventive options. As we present on this paper, many coverage gradient algorithms naturally scale back the entropy—and thus the range of explored trajectories—as a part of coaching, yielding a coverage more and more restricted in its potential to discover. On this paper, we argue that entropy must be actively monitored and managed all through coaching. We formally analyze the contributions of main coverage gradient aims on entropy dynamics, establish empirical components (similar to numerical precision) that considerably impression entropy conduct, and suggest express mechanisms for entropy management. These embrace REPO, a household of algorithms that modify the benefit perform to control entropy, and ADAPO, an adaptive uneven clipping method. Fashions skilled with our entropy-preserving strategies preserve variety all through coaching, yielding ultimate insurance policies which are extra performant and retain their trainability for sequential studying in new environments.
- † MIT
- ‡ Equal contribution
- ** Work carried out whereas at Apple

