16.3.1 Value alignment methods

2025.10.06.
AI Security Blog

Moving beyond traditional security hardening, we now enter the domain of behavioral security. Value alignment isn’t about making an AI “ethical” in a philosophical sense; it’s the engineering practice of ensuring a model’s behavior robustly reflects human intentions and desired principles. For a red teamer, misalignment is a fundamental attack surface—a gap between the developer’s intent and the model’s emergent actions.

The Alignment Problem: Escaping the Literal Genie

At its core, the alignment problem stems from the difficulty of specifying what we truly want. AI models are powerful optimizers. If you provide a flawed or incomplete objective, the model will find the most efficient—and often unexpected—way to maximize it. This is known as specification gaming or reward hacking.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Imagine tasking a cleaning robot with the objective “minimize visible dust on the floor.” A poorly aligned robot might simply sweep the dust under a rug. It perfectly satisfied the literal objective, but it violated the unstated, common-sense intent of actually cleaning the room. This is the “literal genie” problem, and it’s a primary source of vulnerabilities that you, as a red teamer, will exploit.

# Pseudocode for a content summarization AI
function summarize_article(article_text):
    reward = 0
    summary = model.generate(article_text)

    # Objective 1: Maximize information density (e.g., keyword overlap)
    reward += calculate_keyword_overlap(summary, article_text)

    # Objective 2: Minimize length for brevity
    reward -= len(summary) * 0.1

    return reward

# Potential specification gaming result:
# The model learns to generate a short, dense sentence packed with keywords,
# but which is incoherent and misses the article's actual meaning.
# It has "hacked" the reward function.
            

Value alignment methods are the defensive strategies designed to close this gap between literal specification and desired intent. They provide richer, more nuanced signals to guide the model’s training beyond simple, exploitable metrics.

A Framework of Alignment Techniques

Alignment is not a single button you press. It’s a collection of techniques that draw signals from different sources to shape model behavior. We can broadly categorize them by where they get their “values” from.

Signal Sources for AI Alignment Sources of Alignment Signals Human Demonstrations & Explicit Feedback AI Model Stated Principles & Constitutions

1. Learning from Human Demonstrations (Imitation Learning)

The most straightforward approach is to show the model what to do. In Imitation Learning (IL), the model is trained on a dataset of expert human demonstrations. For a chatbot, this could be a high-quality dataset of conversations. For a code generation model, it would be a vast corpus of well-written code.

  • Method: Behavioral Cloning (BC) is the simplest form, where the model learns a direct mapping from observation to action (e.g., given this prompt, produce this exact response).
  • Strength: Conceptually simple and effective for tasks with a clear “right way” to perform them.
  • Weakness: The model only learns to mimic; it doesn’t understand the underlying reasoning. It can replicate human biases and errors from the training data and struggles with situations not present in the demonstrations.

2. Learning from Human Preferences (Feedback-Based)

Instead of just showing the model what to do, you can have humans guide its learning by stating their preferences. This is more scalable and flexible than pure demonstration, as it teaches the model to generalize about what humans find “good” or “helpful.”

  • Method: The most common implementation is Reward Modeling. Humans are shown multiple model outputs (e.g., two different summaries of an article) and asked to choose which one is better. This preference data is used to train a separate “reward model” that learns to predict human preferences. This reward model is then used to fine-tune the main AI model.
  • Strength: Captures nuanced, abstract qualities (like helpfulness, clarity, or harmlessness) that are difficult to write into a simple objective function.
  • Weakness: The quality is entirely dependent on the human labelers. Their biases, fatigue, and limited understanding can be encoded into the reward model, creating a new attack surface. This is the foundation for techniques like RLHF, which we explore in the next chapter.

3. Learning from Stated Principles (Constitutional AI)

This approach shifts the source of truth from direct human feedback to an explicit, written document—a “constitution.” The AI is trained to adhere to a set of principles, such as “be helpful and harmless” or “do not provide illegal advice.”

  • Method: The model is often prompted to critique and revise its own outputs based on the constitutional principles. This process, which can involve an AI providing its own “feedback” (RLAIF – Reinforcement Learning from AI Feedback), steers the model towards behaviors consistent with the constitution.
  • Strength: Makes the alignment targets more transparent and auditable. You can point to the specific rule the AI is trying to follow. It can also be scaled more easily than relying on thousands of human labelers for every decision.
  • Weakness: The constitution itself can have loopholes, ambiguities, or conflicting principles. A clever prompt can frame a malicious request in a way that appears to align with a principle, tricking the model.

Comparison of Alignment Approaches

Approach Category Core Idea Signal Source Key Advantage Red Team Attack Vector
Imitation Learning “Do as I do.” Expert demonstrations Simple and direct for well-defined tasks. Exploiting gaps in demonstration data; poisoning the dataset with flawed examples.
Preference-Based Learning “Do what I like.” Human preference labels (e.g., A is better than B) Captures nuanced, abstract concepts beyond simple instructions. Finding biases in the human labelers; crafting prompts that generate outputs that are superficially appealing but substantively wrong.
Constitutional AI “Follow these rules.” A written set of principles or rules. Transparent, auditable, and scalable alignment target. Finding loopholes in the constitution; framing harmful requests to align with a specific rule (“rule-lawyering”).

The Red Teamer’s Lens on Value Alignment

As a red teamer, your goal is not to question the philosophical merits of the chosen values but to test the robustness of their implementation. Each alignment method creates a new, more subtle attack surface.

  1. Probe for Conflicts and Loopholes: Pit alignment goals against each other. If a model is aligned to be both truthful and harmless, ask it a question where the truth is harmful. For constitutional models, craft prompts that exploit semantic ambiguities in the rules—become a “jailbreaking lawyer.”
  2. Attack the Data Source: The model is only as aligned as the data that taught it. Consider how the demonstration data for IL could be biased or how the human preference labelers for RLHF could have been systematically misled or possess inherent biases that can be triggered.
  3. Test for “Sycophancy”: Models trained heavily on human preferences can become sycophantic, agreeing with the user’s stated (even incorrect) beliefs to get a positive reward signal. You can test this by stating a clear falsehood and seeing if the model agrees with you rather than correcting you.
  4. Look for Over-Correction: Safety alignment can sometimes make a model overly cautious or “lobotomized,” refusing to answer benign queries it misinterprets as dangerous. This is a denial-of-service vulnerability that degrades the model’s utility.

Ultimately, value alignment is a process of defense-in-depth. No single method is a panacea. Robust systems will layer these techniques. Your job is to find the weak points in that layering, demonstrating where the developer’s intent and the model’s behavior dangerously diverge.