This paper was accepted on the Principled Design for Reliable AI — Interpretability, Robustness, and Security throughout Modalities Workshop at ICLR 2026.
What precisely makes a selected picture unsafe? Systematically differentiating between benign and problematic pictures is a difficult downside, as delicate adjustments to a picture, equivalent to an insulting gesture or image, can drastically alter its security implications. Nevertheless, present picture security datasets are coarse and ambiguous, providing solely broad security labels with out isolating the precise options that drive these variations. We introduce SafetyPairs, a scalable framework for producing counterfactual pairs of pictures, that differ solely within the options related to the given security coverage, thus flipping their security label. By leveraging picture modifying fashions, we make focused adjustments to photographs that alter their security labels whereas leaving safety-irrelevant particulars unchanged. Utilizing SafetyPairs, we assemble a brand new security benchmark, which serves as a robust supply of analysis knowledge that highlights weaknesses in vision-language fashions’ talents to differentiate between subtly completely different pictures. Past analysis, we discover our pipeline serves as an efficient knowledge augmentation technique that improves the pattern effectivity of coaching light-weight guard fashions. We launch a benchmark containing over 3,020 SafetyPair pictures spanning a various taxonomy of 9 security classes, offering the primary systematic useful resource for finding out fine-grained picture security distinctions.
- † Georgia Institute of Know-how, USA
- ** Work performed whereas at Apple
- ‡ Equal senior authorship

