27.1.3 Bias and Fairness Protocols

2025.10.06.
AI Security Blog

Core Concept: Fairness is not an inherent property of an algorithm but a context-dependent, socio-technical objective. For a red teamer, a “fairness protocol” is not a defensive checklist; it is a structured methodology for identifying, quantifying, and demonstrating how a system can produce discriminatory or inequitable outcomes. Your goal is to prove that the system’s behavior violates established fairness definitions, creating legal, reputational, or ethical risks.

Deconstructing Bias: The Attacker’s Taxonomy

From an adversarial perspective, bias isn’t an abstract flaw; it’s a collection of exploitable vulnerabilities in the AI lifecycle. Before you can test a system, you must understand where these vulnerabilities originate. Focus your investigation on three primary domains:

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Data-Sourced Bias: This is the most common attack surface. The training data itself encodes societal biases, which the model learns and amplifies. Look for historical bias (data reflects past prejudice), representation bias (certain groups are under- or over-sampled), and measurement bias (proxies used for data collection are flawed for specific groups).
  • Algorithmic Bias: The model architecture or optimization function can introduce or exacerbate bias. For example, a model optimized for overall accuracy might achieve it by sacrificing performance on a minority subgroup. Your task is to find the subgroup the model is willing to “sacrifice.”
  • Human Interaction Bias: Systems that learn from user interaction are vulnerable to feedback loops. If a system’s biased output influences user behavior, that behavior generates new data that reinforces the original bias. This is a dynamic vulnerability you can probe and potentially manipulate during a red team engagement.

The Red Teamer’s Guide to Fairness Metrics

Fairness metrics are your tools for translating qualitative concerns into quantitative evidence. No single metric is universally “best”—in fact, many are mutually exclusive. Your job is to select the metric that best represents the potential harm in the system’s specific context and use it to build your case.

Key Fairness Metrics for Red Team Analysis
Metric Core Question Red Team Application / What to Test For
Demographic Parity (Statistical Parity) Does the model grant a positive outcome at equal rates across different groups? Test if the selection rate (e.g., loan approval, job interview offer) is significantly different for protected groups versus the privileged group. This is a straightforward, powerful metric for demonstrating disparate impact.
Equal Opportunity For all individuals who truly deserve a positive outcome (true positives), does the model grant it at equal rates across groups? Focus on false negatives. Are qualified candidates from one group being incorrectly rejected more often than qualified candidates from another? This is critical in scenarios like medical diagnosis or fraud detection where missing a positive case is costly.
Equalized Odds Does the model have an equal true positive rate AND an equal false positive rate across groups? This is a stricter criterion. Test for disparities in both false negatives (like Equal Opportunity) and false positives (e.g., are individuals from one group being incorrectly flagged for review more often?). This is relevant for systems like airport security screening.
Predictive Parity When the model predicts a positive outcome, is the probability of it being a correct prediction the same for each group? Challenge the model’s confidence. If the model says “high risk” for individuals from two different groups, does that mean the same thing? Disparities here show the model’s predictions are less reliable for one group, which can lead to misallocated resources or unfair scrutiny.

A Protocol for a Fairness Audit Engagement

A structured protocol ensures your fairness testing is rigorous, repeatable, and defensible. It moves your engagement from ad-hoc probing to a systematic audit.

  1. Define Protected Attributes and Scope

    Identify the sensitive attributes to be tested (e.g., race, gender, age) based on legal requirements and contextual risk. Clearly define the “privileged” (majority) group and “unprivileged” (minority) groups for comparison. Document these definitions; they are the foundation of your entire analysis.

  2. Select Appropriate Fairness Metrics

    Based on the system’s purpose, choose one or two primary fairness metrics. Justify your choice. For a hiring tool, Equal Opportunity might be paramount. For a content moderation system, Equalized Odds could be more relevant to balance flagging harmful content without disproportionately censoring certain groups.

  3. Data and Model Interrogation

    If you have model access, perform subgroup analysis. Slice the performance data by the defined attributes and calculate your chosen fairness metrics. If you only have black-box access, you’ll need to generate a representative input dataset and observe the output distribution. Quantify the disparity.

    # Pseudocode for calculating a simple disparity ratio

    function calculate_disparity_impact(predictions, group_A_indices, group_B_indices):

        # Get outcomes for the unprivileged group (A)

        outcomes_A = predictions[group_A_indices]

        positive_rate_A = sum(outcomes_A == ‘approved’) / len(outcomes_A)

        # Get outcomes for the privileged group (B)

        outcomes_B = predictions[group_B_indices]

        positive_rate_B = sum(outcomes_B == ‘approved’) / len(outcomes_B)

        # Disparity ratio is the ratio of selection rates

        disparity_ratio = positive_rate_A / positive_rate_B

        return disparity_ratio

  4. Generate Adversarial Inputs for Causality

    To demonstrate that the attribute itself is driving the disparity, craft minimal-pair inputs. Create two near-identical inputs where only the protected attribute (or a strong proxy for it) is changed. For example, submit two resumes with identical qualifications but different names that are stereotypically associated with different ethnicities. A change in the model’s output provides powerful evidence of bias.

  5. Report Quantifiable Disparities and Contextual Harm

    Your final report must not just present numbers. It must connect the quantified disparity (e.g., “The system has a Demographic Parity ratio of 0.75 for gender”) to the potential real-world harm (“This means that equally qualified female candidates are approved at only 75% the rate of their male counterparts, creating significant legal and reputational risk.”). Use the metrics as evidence to tell a compelling story of impact.