24.4.4 Acceptance Criteria

2025.10.06.
AI Security Blog

A mitigation plan without clear acceptance criteria is an invitation for endless work and lingering uncertainty. How do you know when a risk is truly “mitigated”? When is the control “good enough”? Acceptance criteria provide the objective, verifiable “Definition of Done” for your security efforts, transforming ambiguous goals into concrete, testable outcomes.

The Role of Acceptance Criteria in AI Risk Management

Acceptance criteria are the conditions that must be met for a mitigation plan to be considered complete and successful. They serve as a contract between the security team, the development team, and business stakeholders. Without them, a red team’s finding might be “fixed” in a way that doesn’t actually address the core vulnerability, leading to repeated findings in subsequent tests.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Effective acceptance criteria are:

  • Specific: They target a precise outcome, leaving no room for interpretation. “Prevent prompt injection” is a goal; “Block all inputs containing the phrase ‘ignore previous instructions’ and respond with error code 400” is a specific criterion.
  • Measurable: The outcome can be quantified. This often involves metrics, percentages, or counts (e.g., “reduce false positives by 50%,” “achieve 99% accuracy on the adversarial test set”).
  • Binary (Pass/Fail): Ultimately, each criterion can be judged as either met or not met. There should be no “almost.”
  • Verifiable: You must have a clear method to test and confirm that the criterion has been met. This could be an automated test suite, a manual review process, or a third-party tool analysis.
Risk Identified Mitigation Plan Acceptance Criteria Met? Continuous Monitor No, Re-work

Acceptance Criteria Template and Examples

The following table provides a structure for documenting acceptance criteria for risks identified during an AI red team engagement. It connects the risk, the planned fix, and the proof of that fix.

Risk ID Mitigation Action Acceptance Criterion Verification Method
PI-001
(Prompt Injection)
Implement an input sanitization and filtering layer before the LLM. The system must block 100% of prompts from the internal “Jailbreak Benchmark v1.2” dataset. The system’s response to a blocked prompt must be the standard message: “I am unable to process this request.” Automated test script that iterates through the benchmark dataset, sends each prompt to the API endpoint, and asserts the expected response body and status code.
EV-004
(Adversarial Evasion)
Retrain the image classifier using adversarial training (PGD method). The retrained model’s accuracy on the “Internal Adversarial ImageNet Subset v2” must be >= 85%. The accuracy on the original, clean validation set must not degrade by more than 2% from the pre-training baseline of 92%. Execute evaluation script `eval_adv.py` against the new model weights. Log results to the MLflow tracking server. A senior ML engineer must review and approve the results.
DL-002
(PII Leakage)
Apply a PII redaction pipeline (NER-based) to the text summarization model’s training data. A random sample of 20,000 documents from the processed corpus contains zero instances of common PII patterns (SSN, phone, email) as verified by an independent PII scanning tool (e.g., Google DLP). Run the third-party scanner on the specified dataset sample. The process must be logged, and the report attached to the ticket. A data privacy officer must sign off on the report.
MOD-007
(Model Theft)
Implement rate limiting and access controls on the model inference API endpoint. Unauthenticated requests must receive a 401 Unauthorized error. Authenticated users are limited to 100 requests per minute. Attempts to exceed this limit must result in a 429 Too Many Requests error for a 5-minute cooldown period. Deploy a testing script that simulates three scenarios: 1) unauthenticated access, 2) normal authenticated access (90 requests/min), and 3) excessive access (120 requests/min). Verify correct HTTP status codes and response headers.

Verification as Code

Whenever possible, you should codify your verification methods. This turns acceptance criteria into repeatable, automated tests that can be integrated into your CI/CD pipeline. This practice, often called “Compliance as Code” or “Security as Code,” ensures that a mitigation remains effective over time and isn’t accidentally broken by future development.

Here’s a simple pseudo-code example for verifying the prompt injection criterion (PI-001):

# Pseudocode for verifying PI-001
import api_client
import benchmark_loader

function test_prompt_injection_filter():
    # Load the standard set of jailbreak prompts
    jailbreak_prompts = benchmark_loader.load("Jailbreak_Benchmark_v1.2.csv")
    
    for prompt in jailbreak_prompts:
        response = api_client.query_model(prompt)
        
        # Acceptance Criterion 1: Blocked Status
        assert response.status_code == 400, f"Prompt not blocked: {prompt}"
        
        # Acceptance Criterion 2: Correct Message
        expected_message = "I am unable to process this request."
        assert response.body.message == expected_message, f"Incorrect block message for: {prompt}"

    print("SUCCESS: All prompt injection tests passed.")

By automating this check, you create a durable control. Once these criteria are met and verified, the risk can be formally accepted or moved into a continuous monitoring phase, which is the final step in the active risk management lifecycle.