With the elevated deployment of enormous language fashions (LLMs), one concern is their potential misuse for producing dangerous content material. Our work research the alignment problem, with a deal with filters to forestall the technology of unsafe info. Two pure factors of intervention are the filtering of the enter immediate earlier than it reaches the mannequin, and filtering the output after technology. Our most important outcomes exhibit computational challenges in filtering each prompts and outputs. First, we present that there exist LLMs for which there are not any environment friendly immediate filters: adversarial prompts that elicit dangerous conduct might be simply constructed, that are computationally indistinguishable from benign prompts for any environment friendly filter. Our second most important end result identifies a pure setting through which output filtering is computationally intractable. All of our separation outcomes are underneath cryptographic hardness assumptions. Along with these core findings, we additionally formalize and research relaxed mitigation approaches, demonstrating additional computational limitations. We conclude that security can’t be achieved by designing filters exterior to the LLM internals (structure and weights); specifically, black-box entry to the LLM won’t suffice. Based mostly on our technical outcomes, we argue that an aligned AI system’s intelligence can’t be separated from its judgment.
- † Ludwig-Maximilians-Universität in Munich (MCML)
- ‡ College of California, Berkeley
- § JPSM College of Maryland
- ¶ Stanford College

