This paper was accepted on the Workshop on Regulatable ML (ReML) at NeurIPS 2025.
Current developments in AI governance and security analysis have known as for red-teaming strategies that may successfully floor potential dangers posed by AI fashions. Many of those calls have emphasised how the identities and backgrounds of red-teamers can form their red-teaming methods, and thus the sorts of dangers they’re prone to uncover. Whereas automated red-teaming approaches promise to enrich human red-teaming by enabling larger-scale exploration of mannequin conduct, present approaches don’t take into account the position of identification. As an preliminary step in direction of incorporating individuals’s background and identities in automated red-teaming, we develop and consider a novel technique, PersonaTeaming, that introduces personas within the adversarial immediate technology course of to discover a wider spectrum of adversarial methods. Specifically, we first introduce a technique for mutating prompts based mostly on both “red-teaming professional” personas or “common AI consumer” personas. We then develop a dynamic persona-generating algorithm that robotically generates varied persona varieties adaptive to completely different seed prompts. As well as, we develop a set of latest metrics to explicitly measure the “mutation distance” to enrich current range measurements of adversarial prompts. Our experiments present promising enhancements (as much as 144.1%) within the assault success charges of adversarial prompts by persona mutation, whereas sustaining immediate range, in comparison with RainbowPlus, a state-of-the-art automated red-teaming technique. We talk about the strengths and limitations of various persona varieties and mutation strategies, shedding mild on future alternatives to discover complementarities between automated and human red-teaming approaches.
- † Carnegie Mellon College
- ‡ Impartial Researcher
- ** Work carried out whereas at Apple