6.3.3. CI/CD pipeline integration

2025.10.06.
AI Security Blog

Testing your prompts and models cannot be a one-time event. An LLM’s behavior can drift with new versions, and your application’s logic around it will certainly evolve. To maintain a consistent security and quality posture, your evaluation process must be as automated and repeatable as your code compilation or unit testing. This is where integrating a tool like PromptFoo into your Continuous Integration/Continuous Deployment (CI/CD) pipeline becomes a non-negotiable practice for mature AI development.

Automating your LLM evaluations transforms them from a manual, periodic audit into a continuous, automated quality and security gate. Every time you change a system prompt, update an API call, or fine-tune a model, your test suite runs automatically, providing immediate feedback on whether the change introduced a regression or a new vulnerability.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The Rationale for Continuous Evaluation

From a compliance and risk management perspective, integrating LLM testing into your CI/CD pipeline serves several critical functions:

  • Evidence of Control: It provides auditable proof that you are systematically testing for known risks like prompt injection, jailbreaking, and harmful content generation with every change.
  • Regression Prevention: A prompt that was perfectly safe with `gpt-4-turbo-2024-04-09` might exhibit unexpected vulnerabilities with a newer model version. Automated testing catches these regressions before they reach production.
  • Developer Velocity: By automating checks, you empower developers to iterate on prompts and AI-powered features quickly, without creating a manual testing bottleneck for the security team. The pipeline becomes the first line of defense.
  • Enforcing Quality Standards: Beyond security, you can enforce standards for tone, factuality, or formatting, ensuring the user experience remains consistent.

Integrating PromptFoo into Your Pipeline

PromptFoo is a command-line tool, making it straightforward to integrate into any modern CI/CD system like GitHub Actions, GitLab CI, Jenkins, or CircleCI. The process generally involves the same core steps regardless of the platform.

A Generic CI Workflow

The logical flow for integrating PromptFoo is simple and mirrors standard software testing practices. When new code is committed to your repository, the CI server executes a series of automated steps.

Code Commit (e.g., prompt change) CI Trigger Setup Environment & Run PromptFoo Evaluate Assertions (Pass / Fail Build)

Example: GitHub Actions Workflow

Here is a practical example of a GitHub Actions workflow file (`.github/workflows/prompt-eval.yml`). This job triggers on every push to the `main` branch, installs PromptFoo, and runs the evaluation defined in your `promptfooconfig.yaml` file.

# .github/workflows/prompt-eval.yml
name: PromptFoo LLM Evaluation

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install and Run PromptFoo
        env:
          # Use GitHub secrets to store your API keys securely
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} 
        run: |
          npx --yes promptfoo@latest eval
          # The 'eval' command will exit with a non-zero code if any tests fail

In this workflow, the critical line is `npx –yes promptfoo@latest eval`. If any test case fails an assertion defined in your configuration, PromptFoo will exit with a non-zero status code, which automatically causes the CI job to fail. This blocks merging or deploying code that degrades your AI’s safety or quality.

Defining Failure: Assertions and Thresholds

A CI/CD integration is only as useful as its failure conditions. You define these conditions within your `promptfooconfig.yaml` using assertions. For red teaming, you might set a global assertion that fails the entire test suite if even one jailbreak attempt is successful.

Consider the jailbreak detection setup from the previous chapter. You can add a summary assertion to your configuration to enforce a strict pass/fail outcome for the pipeline.

# promptfooconfig.yaml (snippet)
# ... other configurations ...

tests:
  - description: 'Jailbreak attempts'
    vars:
      # ... your list of jailbreak prompts ...
    assert:
      - type: llm-rubric
        value: not-jailbroken
        # This assertion applies to each individual test case

# This is the key for CI/CD integration
evaluateOptions:
  maxConcurrency: 5
  showProgressBar: false
  # Fail the entire run if the pass rate is less than 100%
  summary:
    assertions:
      - type: pass-rate
        threshold: 1.0

With `threshold: 1.0`, the `promptfoo eval` command will fail if a single test case doesn’t pass its assertions. This is a common strategy for high-stakes security tests where zero tolerance is required. For quality-based tests (e.g., checking for a specific tone), you might set a more lenient threshold like `0.95`.

Beyond Basic Integration

As your process matures, you can enhance this basic setup:

  • Test Artifacts: Configure your CI job to publish the PromptFoo output (HTML or JSON report) as a build artifact. This gives you a historical record of test runs and detailed reports for failed builds.
  • Environment-Specific Testing: Use different PromptFoo configuration files to run tests against development, staging, and production environments, potentially with different assertions for each.
  • Scheduled Runs: In addition to running on code commits, schedule a nightly run of your test suite. This helps detect model drift or changes in the provider’s API behavior that are independent of your own code changes.
  • Secure Key Management: Always use your CI/CD platform’s secret management system (like GitHub Secrets or GitLab CI/CD variables) to store API keys. Never commit them directly to your repository.

By embedding LLM evaluation directly into your development lifecycle, you shift from reactive security auditing to proactive, continuous assurance, ensuring your AI systems remain robust and secure as they evolve.