On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

With the elevated deployment of enormous language fashions (LLMs), one concern is their potential misuse for producing dangerous content material. Our work research the alignment problem, with a deal with filters to forestall the technology of unsafe info. Two pure factors of intervention are the filtering of the enter immediate earlier than it reaches the mannequin, and filtering the output after technology. Our most important outcomes exhibit computational challenges in filtering each prompts and outputs. First, we present that there exist LLMs for which there are not any environment friendly immediate filters: adversarial prompts that elicit dangerous conduct might be simply constructed, that are computationally indistinguishable from benign prompts for any environment friendly filter. Our second most important end result identifies a pure setting through which output filtering is computationally intractable. All of our separation outcomes are underneath cryptographic hardness assumptions. Along with these core findings, we additionally formalize and research relaxed mitigation approaches, demonstrating additional computational limitations. We conclude that security can’t be achieved by designing filters exterior to the LLM internals (structure and weights); specifically, black-box entry to the LLM won’t suffice. Based mostly on our technical outcomes, we argue that an aligned AI system’s intelligence can’t be separated from its judgment.

† Ludwig-Maximilians-Universität in Munich (MCML)
‡ College of California, Berkeley
§ JPSM College of Maryland
¶ Stanford College

Main Menu

What's Hot

Faux Zoom and Google Meet Pages Trick Customers Into Putting in Monitoring Instrument

I attempted Lenovo’s modular ThinkBook laptop computer, and it is a idea I would really root for

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Constructing a scalable digital try-on resolution utilizing Amazon Nova on AWS: half 1

Getting Began with Python Async Programming

Construct Semantic Search with LLM Embeddings

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Faux Zoom and Google Meet Pages Trick Customers Into Putting in Monitoring Instrument

I attempted Lenovo’s modular ThinkBook laptop computer, and it is a idea I would really root for

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

7 Essential Issues Earlier than Deploying Agentic AI in Manufacturing

Main Menu

Subscribe to Updates

What's Hot

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Related Posts