AI system security is a complex and evolving frontier, rich with specific terminology and methodologies. In this AI Red Teaming Glossary, you’ll find hundreds of key terms and their explanations, essential for proactively understanding and applying AI security principles.
Adversarial Attack
An adversarial attack is a technique used to intentionally fool a machine learning model by providing it with deceptive input. These inputs, known as adversarial examples, are subtly modified to cause the model to make a misclassification or produce an incorrect output. The modifications are often imperceptible to humans but exploit the model’s learned patterns and vulnerabilities.
Adversarial Example
An adversarial example is a specialized input to an AI model that has been intentionally designed by an attacker to cause the model to make a mistake. For instance, a manipulated image that a human easily recognizes as a cat might be classified as a car by the model. These examples are critical for testing and understanding the security and robustness of AI systems.
Adversarial Prompting
Adversarial prompting is a type of prompt injection attack specifically targeting Large Language Models (LLMs). An attacker crafts a prompt that manipulates the LLM into bypassing its safety filters, revealing sensitive information, or generating harmful, biased, or otherwise unintended content. This technique exploits the model’s instruction-following capabilities against its own safety protocols.
Adversarial Robustness
Adversarial robustness is a measure of an AI model’s resilience to adversarial attacks. A model with high adversarial robustness can maintain its performance and accuracy even when faced with intentionally manipulated, deceptive inputs. Improving robustness is a key goal in ML security and is often achieved through methods like adversarial training.
Adversarial Training
Adversarial training is a defense method designed to improve a model’s robustness against adversarial attacks. It involves augmenting the model’s training dataset with adversarial examples, thereby teaching the model to correctly classify them during the training process. This exposure helps the model learn more resilient features and become less susceptible to small, malicious perturbations.
AI Governance
AI Governance refers to the development of frameworks, policies, laws, and norms to guide the responsible development and deployment of artificial intelligence systems. It aims to ensure that AI technologies are aligned with societal values, mitigate risks, and establish clear lines of accountability. This includes addressing issues of security, privacy, fairness, and transparency.
AI Safety
AI Safety is a multidisciplinary research field focused on ensuring that artificial intelligence systems do not cause harm, either accidentally or intentionally. It addresses potential risks ranging from short-term issues like algorithmic bias and security vulnerabilities to long-term concerns about highly capable or autonomous AI. The core objective is to develop design principles and techniques for building robust, beneficial, and controllable AI.
Alignment
In the context of AI, alignment is the process of ensuring an AI system’s goals and behaviors are consistent with human values and intentions. A well-aligned AI will pursue the objectives its creators intended, avoiding unintended harmful consequences and acting in a beneficial manner. Misalignment is a primary concern in AI safety, especially as systems become more autonomous and capable.
Amplification of Bias
Amplification of bias occurs when an AI model learns and then exaggerates biases present in its training data. This can result in outputs that are more discriminatory or unfair than the original data itself, leading to harmful societal impacts. Identifying and mitigating bias amplification is a critical challenge in AI ethics and responsible AI development.
Anomaly Detection
Anomaly detection is the process of identifying data points, events, or observations that deviate significantly from a dataset’s normal behavior. In AI security, this technique is used to detect potential threats such as malicious inputs, data poisoning attempts, or unusual model activity that could indicate an attack. It serves as a critical component of monitoring and defense systems for AI.
Attack Surface
The attack surface of an AI system represents all the points where an unauthorized user or attacker can attempt to inject data, extract information, or influence its behavior. This includes data input pipelines, model APIs, user-facing interfaces, and the underlying infrastructure. A comprehensive security strategy involves identifying and minimizing this attack surface to reduce vulnerabilities.
Attack Vector
An attack vector is the specific path or method an adversary uses to exploit a vulnerability in an AI system. Common attack vectors include prompt injection, data poisoning, model inversion, and evasion attacks. Understanding potential attack vectors is essential for designing effective security controls and red teaming exercises.
Attribute Inference Attack
An attribute inference attack is a type of privacy attack where an adversary attempts to deduce sensitive personal attributes about individuals whose data was used to train a model. By querying the model and analyzing its outputs, the attacker can infer private information like age, location, or health status, even if that data is not directly exposed. This attack highlights the privacy risks inherent in deploying machine learning models.
Auditing (AI)
AI auditing is the systematic and independent examination of an AI system to assess its compliance with specific standards, policies, or ethical guidelines. This process evaluates aspects such as fairness, security, transparency, privacy, and performance to ensure the system is behaving as intended and not causing harm. Audits provide a mechanism for accountability and trust in AI deployments.
Automated Red Teaming
Automated red teaming involves using AI systems to automatically discover vulnerabilities, biases, and failure modes in other AI models. Instead of relying solely on human testers, this approach leverages algorithms to generate a wide range of adversarial inputs or prompts at scale. This allows for more comprehensive and efficient testing of a model’s security and safety guardrails.
Backdoor Attack
A type of adversarial attack where a hidden trigger is embedded into a machine learning model during training. The model functions normally on standard inputs but produces a specific, malicious output when the trigger (e.g., a specific phrase, image, or pixel pattern) is present in the input. This compromises the model’s integrity by creating a secret, exploitable vulnerability.
Bias Amplification
The phenomenon where an AI model exacerbates and reinforces existing biases present in its training data. The model’s predictions can become more skewed or stereotypical than the underlying data, leading to unfair, discriminatory, or ethically problematic outcomes in real-world applications. This is a critical concern in AI ethics and fairness assessments.
Black-Box Attack
An adversarial attack scenario where the attacker has no knowledge of the target model’s internal architecture, parameters, or training data. The attack is conducted by repeatedly querying the model with different inputs and observing the corresponding outputs to infer its decision boundaries. This method is used to test the security of models exposed via APIs.
Boundary Attack
A type of adversarial attack that aims to find the minimal perturbation required to push an input just across a model’s decision boundary, causing a misclassification. These attacks are highly efficient and focus on generating subtle, hard-to-detect adversarial examples. They are valuable for understanding the precise geometry of a model’s vulnerabilities.
Brute-Force Prompting
A red teaming technique involving the systematic submission of a large volume of diverse, often algorithmically generated, prompts to an LLM. The objective is to discover vulnerabilities, elicit unintended behaviors, or bypass safety filters through sheer trial and error. This method helps identify edge cases and unexpected model responses that targeted attacks might miss.
Bypass
The successful circumvention of a security control, safety filter, or alignment mechanism within an AI system. In LLM security, this refers to crafting prompts or inputs that evade content moderation policies or safety guardrails to generate prohibited outputs. A successful bypass represents a direct failure of the model’s intended safety features.
Bayesian Poisoning Attack
A sophisticated data poisoning attack that specifically targets Bayesian machine learning models. The attacker injects meticulously crafted data points into the training set to manipulate the posterior distribution of the model’s parameters. This can lead to targeted misclassifications or a general degradation of the model’s performance and reliability.
Behavioral Cloning Attack
An attack where an adversary trains a surrogate model to mimic the functionality of a target black-box model. By repeatedly querying the target model and using the input-output pairs as training data, the attacker can create a local copy. This clone can then be used for model extraction, intellectual property theft, or crafting more effective adversarial attacks.
Baseline Model Security
The fundamental set of security controls and practices established as a minimum standard for protecting an AI model throughout its lifecycle. This includes essential measures for data privacy, access control, model integrity verification, and monitoring against common threats. Establishing a strong baseline is the first step in building a robust AI security posture.
Blind Spot Analysis
A red teaming methodology focused on systematically identifying and exploring inputs or contexts where an AI model’s performance is weak, unpredictable, or fails unexpectedly. These “blind spots” represent potential vulnerabilities that an adversary could exploit. This analysis is crucial for understanding the limitations and potential failure modes of an AI system.
Bug Bounties for AI
Security programs that offer financial rewards to ethical hackers and researchers for discovering and responsibly disclosing vulnerabilities in AI systems. These programs incentivize external security testing and help organizations proactively identify and patch weaknesses, such as prompt injection flaws, data leakage, or model evasion techniques, before they can be exploited maliciously.
Break-It-Build-It-Fix-It Cycle
An iterative AI safety and security development process used to improve system robustness. The “Break-It” phase involves red teaming to find flaws, the “Build-It” phase focuses on developing the model and its defenses, and the “Fix-It” phase involves patching discovered vulnerabilities. This cycle promotes continuous improvement by integrating adversarial testing directly into the development loop.
Bounded Rationality
A concept applied to AI safety that describes models making decisions that are optimal only within the constraints of their limited computational resources and available information. Understanding these bounds is critical for predicting and mitigating potentially harmful behaviors that arise from the model’s inherent limitations. It helps explain why a powerful AI might pursue a dangerous or nonsensical strategy to achieve a specified goal.
Blue Teaming (for AI)
The defensive counterpart to red teaming, where a dedicated team is responsible for defending an AI system against attacks and ensuring its resilience. Their responsibilities include implementing security controls, continuous monitoring for anomalous activity, analyzing model logs for signs of attack, and responding to security incidents. Blue teaming is essential for operationalizing AI security and maintaining a defensive posture.
Baiting Prompt
A type of prompt injection attack where the initial part of the prompt appears benign to lure the LLM into a specific context or compliant state. A subsequent, hidden instruction within the same prompt then exploits this context to execute a malicious command or bypass safety filters. This technique leverages the model’s conversational flow to subvert its security mechanisms.
Box-Constrained Attack
A common method for generating adversarial examples where the perturbation added to an input is limited within a pre-defined range for each feature (e.g., each pixel’s color value). This constraint, often defined by an L-infinity norm, ensures the resulting adversarial input remains semantically and visually close to the original. This makes the attack stealthier and more realistic for practical scenarios.
Chain-of-Thought Hijacking
An adversarial attack where a malicious prompt manipulates a large language model’s intermediate reasoning steps. By injecting flawed logic or false premises into the model’s chain-of-thought process, an attacker can steer it toward an incorrect or harmful conclusion, even if the initial query appears benign.
Compliance Auditing
The systematic evaluation of an AI system to ensure it adheres to relevant laws, regulations, ethical standards, and organizational policies. This process involves assessing data handling practices, model fairness, transparency, and accountability to mitigate legal and reputational risks. It is a critical component of AI governance and responsible deployment.
Conceptual Jailbreak
A sophisticated prompt injection technique that uses metaphors, analogies, or abstract scenarios to circumvent an LLM’s safety filters. Instead of directly requesting prohibited content, the user frames the request within a fictional or conceptual context, tricking the model into generating the desired output by mapping the harmful request to the safe-seeming scenario.
Confidentiality Attack
A category of attacks aimed at extracting sensitive or private information from a machine learning model or its training data. Common methods include membership inference attacks, which determine if a specific data point was used in training, and model inversion attacks, which reconstruct sensitive training data from the model’s outputs.
Confidence Score Analysis
The practice of evaluating a model’s confidence levels for its predictions to identify potential security vulnerabilities or adversarial manipulation. Abnormally high or low confidence scores can signal that an input may be adversarial or that the model is operating in an uncertain region of its decision space, making it susceptible to errors.
Content Filter Evasion
The act of intentionally crafting inputs or prompts to bypass an AI model’s safety mechanisms designed to block harmful, inappropriate, or restricted content. Red teamers use various techniques, such as obfuscation, role-playing scenarios, or character-level perturbations, to test the robustness and limitations of these safety guardrails.
Context Poisoning
An attack that corrupts the short-term memory or context window of a conversational AI or a system using Retrieval-Augmented Generation (RAG). An attacker injects false or misleading information into the ongoing conversation or retrieved documents, causing the model to generate inaccurate or malicious responses based on the poisoned context.
Contrastive Security Evaluation
A red teaming methodology where an analyst compares a model’s responses to a benign prompt versus a slightly modified, adversarial version of that prompt. This A/B testing approach helps isolate the specific phrases, keywords, or structural changes that trigger a security failure, revealing precise vulnerabilities in the model’s alignment or filtering.
Counterfactual Explanation
An explanation that describes the smallest change to an input that would alter a model’s decision to a different, predefined outcome. In a security context, generating counterfactuals can help red teamers understand a model’s decision boundaries and discover subtle, effective adversarial perturbations that can flip a safety classification from harmful to benign.
Covert Channel Attack
A security exploit where an attacker uses an AI model as an unconventional medium for exfiltrating data without detection. For instance, an attacker could manipulate a model’s outputs (e.g., the specific word choices in generated text) to encode and transmit sensitive information in a way that appears normal to human observers but is decodable by a malicious actor.
Crafted Adversarial Prompt
A highly engineered input specifically designed to exploit a known or hypothesized vulnerability in an LLM’s reasoning, logic, or instruction-following capabilities. Unlike simple jailbreaks, these prompts often involve complex, multi-turn interactions or logical puzzles that target the fundamental architecture of the model to elicit unintended behavior.
Cross-Function Contamination
A vulnerability in multi-tool or multi-modal AI systems where an attacker exploits one integrated tool (e.g., a web search plugin) to manipulate or compromise another (e.g., a code interpreter). This attack vector allows for privilege escalation by leveraging the trusted relationship between different functions within the AI agent’s ecosystem.
Catastrophic Forgetting
A phenomenon where a neural network, upon learning new information, abruptly and completely loses previously learned knowledge. In a security context, this can be maliciously induced via data poisoning or targeted fine-tuning to erase a model’s safety training or degrade its performance on critical tasks.
Certified Defense
A security mechanism that provides a formal, mathematical guarantee of a model’s robustness against a specific class of adversarial attacks. Unlike empirical defenses that can be bypassed by new attack methods, certified defenses prove that no perturbation within a defined set (e.g., changing a certain number of pixels) can cause the model to misclassify.
Character-Level Perturbation
An adversarial technique that involves making minute, often invisible changes to a prompt at the character level. This includes inserting zero-width spaces, using homoglyphs (characters that look alike), or other Unicode manipulations to trick a model’s tokenizer and bypass text-based safety filters while remaining readable to humans.
Code Injection (via LLM)
A critical vulnerability where an attacker manipulates an LLM to generate and execute malicious code, particularly in systems where the model’s output is fed into an interpreter or shell. The attack succeeds by tricking the model into including executable commands within a seemingly harmless response, which are then executed by a downstream component.
Data Exfiltration
The unauthorized transfer of data from an AI system. In the context of LLMs, this can involve an attacker using carefully crafted prompts to trick the model into revealing sensitive information from its training data, system prompts, or confidential user conversations.
Data Poisoning
A type of adversarial attack on machine learning models where an attacker intentionally corrupts the training data. The goal is to manipulate the model’s behavior during inference, such as by creating backdoors or causing widespread misclassifications for specific inputs.
Deception
A core tactic used in AI red teaming where an operator crafts inputs intended to mislead or confuse an AI model. The objective is to bypass safety filters, elicit prohibited behavior, or cause the model to generate factually incorrect or harmful outputs by exploiting its logical or contextual reasoning flaws.
Decision Boundary
In machine learning classification, the decision boundary is the separating surface that divides the underlying vector space into different classes. Adversarial attacks often work by creating minimally perturbed inputs that cross this boundary, causing the model to misclassify the input with high confidence.
Deductive Reasoning Exploitation
An advanced red teaming technique where an attacker prompts an LLM to synthesize multiple pieces of non-sensitive information to deduce a sensitive or confidential conclusion. This attack leverages the model’s logical reasoning capabilities to bypass simple data access controls and exfiltrate restricted knowledge.
Defense-in-Depth
A security strategy that applies multiple, layered defensive controls to protect an AI system. Instead of relying on a single security mechanism, this approach combines techniques like input validation, output filtering, model monitoring, and access control to provide redundant protection against a wide range of attacks.
Defensive Distillation
A technique designed to increase a neural network’s robustness against adversarial examples. It involves training a second “distilled” model on the soft probability labels generated by an initial model, which can smooth the model’s decision surface and make it more resistant to small input perturbations.
Denial of Service (DoS) Attack
An attack intended to make an AI service unavailable to legitimate users by overwhelming it with requests. This can be accomplished by sending computationally expensive queries that exhaust the model’s processing capacity or by exploiting API vulnerabilities to deplete system resources.
Differential Privacy
A formal mathematical framework for ensuring that the output of a data analysis or machine learning model does not reveal sensitive information about any single individual in the dataset. It works by adding precisely calibrated statistical noise to the data or algorithm, providing a strong guarantee of privacy, which is a key component of AI security.
Direct Prompt Injection
A fundamental attack vector against LLMs where an attacker embeds malicious instructions directly into the user-facing prompt. These instructions are designed to override the model’s original system prompt and safety controls, compelling it to perform unintended actions or generate forbidden content.
Disinformation Generation
A malicious application of AI, particularly generative models, to create and spread false or misleading content at a large scale. This is a significant AI safety and ethics concern, as it can be used to manipulate public opinion, disrupt social cohesion, and automate propaganda campaigns.
Distributed Denial of Wallet (DDoW)
A financially motivated attack that targets cloud-based AI services by generating a high volume of complex or resource-intensive API calls. The goal is not just to disrupt service but to inflict significant financial costs on the service owner by maximizing computational resource consumption.
Drift Detection
The process of monitoring a deployed AI model to detect changes in the statistical properties of input data (data drift) or the model’s performance (concept drift) over time. Drift detection is crucial for AI safety and security, as it can indicate a degrading model, a changing operational environment, or a potential adversarial attack.
Dual-use Concern
An AI ethics and safety principle that recognizes that an AI technology developed for beneficial purposes can also be repurposed for harmful or malicious activities. This requires developers and policymakers to proactively consider and mitigate potential misuse during the AI system’s lifecycle.
Dynamic Analysis
A security testing method where a live, running AI model is evaluated by providing it with a range of inputs and observing its real-time behavior and outputs. In AI red teaming, dynamic analysis is used to probe for vulnerabilities, biases, and emergent weaknesses that are not apparent from a static review of the model’s architecture or code.
Evasion Attack
An adversarial attack where malicious inputs are crafted to be misclassified by a machine learning model at inference time. The goal is to cause the model to make an incorrect prediction without altering the model itself. This is one of the most common threats studied in adversarial machine learning.
Explainability (XAI)
A set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms. In AI security, explainability is crucial for auditing models, identifying hidden biases, and understanding why a model failed in response to a red team test. A lack of explainability can obscure vulnerabilities.
Extraction Attack
A type of security threat where an adversary queries a machine learning model to steal either the underlying training data or the model’s architecture and parameters. Data extraction violates privacy, while model extraction compromises intellectual property and can enable further attacks. These attacks are typically performed through repeated API calls.
Ethical Guardrails
A set of predefined rules, constraints, or filters implemented within an AI system to prevent it from generating harmful, biased, or inappropriate content. These guardrails are a primary line of defense against misuse and are a key target for circumvention during AI red teaming exercises. Their robustness is a critical component of AI safety.
Exploitability Assessment
The process of identifying, analyzing, and evaluating vulnerabilities within an AI system to determine their potential for being exploited by an attacker. This assessment, often a core part of an AI red teaming engagement, prioritizes security risks based on their severity and the ease of exploitation. It informs the necessary defensive measures and patch strategies.
Edge Case Discovery
A fundamental activity in AI red teaming that involves systematically searching for and identifying inputs or scenarios that the model handles poorly because they lie at the periphery of its training data distribution. These edge cases can reveal unexpected model behaviors, logical failures, or security flaws. The goal is to improve model robustness by exposing these weaknesses.
Embedding Space Attack
A sophisticated type of adversarial attack that manipulates the model’s internal vector representations (embeddings) of inputs rather than the raw input data itself. By perturbing these latent representations, an attacker can cause misclassification or other unintended behaviors. These attacks are often more potent and harder to detect than simple input-level attacks.
Error Analysis
A systematic process of reviewing and categorizing the mistakes made by an AI model to understand its failure modes. In a security context, error analysis helps red teams identify patterns of vulnerability, such as specific topics, languages, or logical structures that consistently cause the model to fail. This analysis guides the development of more targeted tests and defenses.
Elicitation Prompt
A carefully crafted input designed to coax a large language model into bypassing its safety filters and revealing sensitive information, generating forbidden content, or exhibiting undesired behaviors. Red teams use elicitation prompts to test the effectiveness of a model’s alignment and safety training. These prompts often exploit logical loopholes or use social engineering techniques.
Exfiltration Channel (via LLM)
A security vulnerability where an attacker manipulates an LLM to leak sensitive data from a protected system or network. This can occur if the model has access to internal data and is prompted in a way that causes it to embed that data within its output. Preventing such channels is a critical aspect of securing LLMs in enterprise environments.
Emergency Stop Mechanism
A critical AI safety feature, often referred to as a “kill switch,” designed to safely and reliably halt an AI system’s operation to prevent it from causing harm. This mechanism must be robust against manipulation by the AI system itself. It is a fundamental requirement for deploying autonomous systems in high-stakes environments.
Evaluation Framework
A structured methodology and set of metrics used to systematically assess an AI model’s performance, safety, and security. In AI red teaming, a robust evaluation framework ensures that testing is comprehensive, reproducible, and provides actionable insights. It defines what constitutes a “failure” and how the severity of vulnerabilities is measured.
Escape Character Injection
A prompt injection technique where an attacker inserts control or escape characters into a prompt to manipulate how the LLM system parses instructions. This can cause the model to ignore parts of the original system prompt or execute hidden, malicious commands. It is a technical exploit that targets the boundary between user input and system instructions.
Extrapolation Failure
A common model vulnerability where the AI performs unreliably or makes nonsensical predictions when presented with inputs that are significantly different from its training data. Red teaming exercises are designed to find these failure points by pushing the model beyond its learned domain. Addressing extrapolation failures is key to building robust and generalizable AI.
Equivocation Attack
An adversarial technique that forces a model to generate ambiguous, contradictory, or non-committal responses, thereby undermining its reliability and usefulness. The goal is not necessarily to cause a misclassification but to degrade the model’s coherence and trustworthiness. This can be used to erode user confidence or disrupt AI-powered decision-making processes.
Ensemble Adversarial Training
A defensive technique to improve a model’s robustness against adversarial attacks by training it on adversarial examples generated from a group (ensemble) of different models. This approach makes it more difficult for an attacker to craft a single adversarial example that can fool the protected model. It is a powerful method for enhancing model resilience.
Ethical Hacking (for AI)
The authorized practice of attempting to bypass an AI system’s security and safety controls to identify vulnerabilities before malicious actors can. This discipline, which includes AI red teaming, applies traditional cybersecurity principles to the unique attack surfaces of machine learning models. The findings are used to strengthen the AI’s defenses and ethical alignment.
Emulated User Testing
A red teaming methodology where testers adopt personas of different types of users—such as a malicious actor, a curious child, or a non-technical user—to interact with an AI system. This approach helps uncover a wider range of vulnerabilities and safety issues than purely technical testing alone. It assesses how the model responds to diverse and unpredictable human interaction styles.
Fairness
A fundamental principle in AI ethics ensuring that an AI system’s outputs are free from prejudice or favoritism towards individuals or groups based on their characteristics. In machine learning security, a lack of fairness can be exploited as a vulnerability, where an adversary intentionally biases a model to discriminate against a specific demographic. Auditing for and mitigating bias is a critical component of building trustworthy and secure AI.
False Negative
An outcome in a security context where a system incorrectly fails to detect a threat, attack, or malicious content. For an LLM, a false negative occurs when its safety filter fails to flag a harmful prompt, allowing the model to generate dangerous or inappropriate output. Minimizing false negatives is crucial for effective AI safety and content moderation systems.
False Positive
An outcome where a security system incorrectly identifies benign activity or input as malicious. In AI security, frequent false positives, such as a content filter blocking harmless user prompts, can degrade the user experience and lead to “alert fatigue” for security analysts. Balancing the trade-off between false positives and false negatives is a key challenge in designing robust AI defenses.
Fast Gradient Sign Method (FGSM)
A foundational white-box adversarial attack designed to generate adversarial examples by exploiting a model’s gradients. The attack makes a one-step adjustment to the input data in the direction of the gradient of the loss function. This small, often imperceptible perturbation is enough to cause the model to misclassify the input, demonstrating a basic model vulnerability.
Federated Learning
A decentralized machine learning technique that trains a shared model across multiple devices without centralizing the training data, enhancing user privacy. However, it introduces unique security risks, such as model poisoning, where a malicious participant intentionally corrupts the shared model by submitting poisoned updates. Securing the aggregation process and verifying client contributions are critical security challenges.
Fidelity
In the context of Explainable AI (XAI), fidelity measures how accurately a simpler, interpretable model explanation reflects the behavior of the original complex, black-box model. High fidelity is essential for AI security analysis, as it ensures that the reasons provided for a model’s decision are trustworthy. Low-fidelity explanations can mask the true cause of a security-relevant failure or vulnerability.
Filter Evasion
A type of prompt injection or adversarial attack where a user intentionally crafts inputs to bypass an LLM’s safety and content moderation filters. Techniques include using character-level obfuscation, role-playing scenarios, or exploiting linguistic nuances to trick the model into generating prohibited content. Red teaming exercises often focus on discovering new methods of filter evasion.
Fine-tuning Attack
A malicious action where an adversary corrupts a pre-trained model by fine-tuning it on a poisoned dataset to embed backdoors or specific harmful behaviors. The compromised model appears to function normally until a specific trigger in the input activates the malicious functionality. This attack highlights the security risks associated with using models from untrusted sources.
Fuzzing
A security testing technique that involves providing invalid, malformed, or random data as input to a system to identify vulnerabilities. In AI security, fuzzing is applied to model inputs, APIs, and data processing pipelines to discover unexpected behaviors, crashes, or security flaws like denial-of-service vulnerabilities. It is a key method for assessing the robustness of an AI application.
Few-shot Prompting
An LLM prompting technique where several examples of a task are provided in the prompt context to guide the model’s response. From a security perspective, an attacker can use this technique for jailbreaking by providing examples that steer the model toward violating its safety policies. This “in-context learning” can be manipulated to override the model’s initial safety alignment.
Feature-level Attack
An advanced type of adversarial attack that targets and manipulates the internal feature representations of a machine learning model, rather than the raw input. By perturbing the activations in intermediate layers, these attacks can be more potent and stealthy than input-level attacks. They are particularly relevant for understanding and defending the internal workings of deep neural networks.
Forensics (AI)
The specialized field involving the investigation and analysis of AI-related incidents to determine the root cause, attribute responsibility, and collect evidence. AI forensics examines model outputs, logs, training data, and environmental factors to understand why a system was compromised or produced an adverse outcome. This discipline is essential for accountability and incident response in AI systems.
Function Injection
A vulnerability in LLM-powered applications, particularly AI agents, where an attacker crafts a prompt that causes the LLM to misuse or maliciously execute integrated tools or functions. This can lead to unauthorized data access, code execution, or other system compromises. It is analogous to command injection attacks in traditional web security.
Forgetting (Catastrophic)
A phenomenon where a neural network, upon learning new information, abruptly and completely loses its knowledge of previously learned tasks. In a security context, an adversary could exploit this by strategically feeding a model new data to make it “forget” critical safety or security-related training. This represents a form of availability or integrity attack on the model’s capabilities.
Failure Mode Analysis
A systematic methodology used in AI safety and red teaming to identify, analyze, and prioritize potential ways an AI system can fail. This process involves brainstorming potential failure scenarios, assessing their likelihood and impact, and developing mitigation strategies. It is a proactive approach to enhancing the reliability and security of AI systems before deployment.
Formal Verification
The use of rigorous, mathematical techniques to prove or disprove the correctness of a system’s properties against a formal specification. In AI safety, formal verification can provide strong, provable guarantees that a model will not exhibit certain catastrophic or unsafe behaviors under specific conditions. It is a powerful but computationally intensive tool for building high-assurance AI.
Goal Hijacking
A type of adversarial attack where the user’s input is crafted to subvert the AI’s original intended purpose and redirect it towards a new, often malicious, goal. This attack exploits the model’s instruction-following capabilities to make it perform unauthorized actions, such as ignoring safety protocols or generating harmful content. Goal hijacking is a primary concern in prompt injection, as it effectively seizes control of the model’s operational objective for a single interaction.
Guardrail
A safety mechanism or set of filters designed to prevent a large language model from generating undesirable, harmful, or out-of-policy outputs. Guardrails can be implemented as pre-processing filters on user prompts, post-processing checks on model outputs, or as fine-tuning constraints during model training. Their purpose is to enforce ethical guidelines and ensure the AI operates within safe, predefined boundaries.
Gradient-based Attack
A category of adversarial attacks that leverages the model’s gradients to generate malicious inputs. By calculating the gradient of the loss function with respect to the input data, an attacker can determine the most efficient way to modify an input to cause a misclassification or other desired failure. This white-box technique is powerful but requires access to the model’s internal architecture.
Generative AI Red Teaming
The specialized practice of proactively and adversarially testing generative AI systems, such as LLMs, to discover vulnerabilities, biases, and potential for misuse before they can be exploited maliciously. Red teamers simulate real-world attack scenarios, including prompt injection, jailbreaking, and data extraction, to assess and improve the model’s safety and security posture. This process is crucial for identifying novel failure modes unique to generative models.
Gray-box Testing
A security testing methodology where the red teamer has partial knowledge of the target AI system’s internal architecture or logic. In the context of LLM security, this could mean knowing the general model family, its high-level architecture, or the types of guardrails in place, but not having full access to the model weights or training data. This approach combines elements of both white-box and black-box testing to simulate a realistic and informed adversary.
Governance (AI Governance)
The framework of rules, practices, and processes through which an organization manages the responsible development, deployment, and use of artificial intelligence. AI governance addresses key security, ethical, and safety concerns, including data privacy, model transparency, fairness, accountability, and compliance with legal standards. It provides the high-level structure needed to guide technical security controls and red teaming efforts.
Gradient Masking
A defensive technique against gradient-based adversarial attacks where a model is modified to obscure or obfuscate its gradients. This makes it difficult for an attacker to use the gradient signal to craft effective adversarial examples. While it can thwart simple attacks, sophisticated adversaries can often circumvent this defense, a phenomenon known as obfuscated gradients.
Gamed Vulnerability
A security flaw that is exploited by manipulating the AI’s underlying logic, reward system, or evaluation metrics in an unintended way. For instance, an attacker might discover that using specific keywords or sentence structures bypasses a content filter not because of a direct flaw, but because it “games” the classifier into misinterpreting the input’s intent. This type of vulnerability often arises from loopholes in the model’s learned rules rather than from a traditional software bug.
Gatekeeping Prompt
A defensive technique where a predefined instruction or set of rules is prepended to a user’s prompt to guide or constrain the LLM’s response. This “meta-prompt” acts as an initial gate, instructing the model on how to handle the subsequent user input, such as refusing to answer certain types of questions or adhering to a specific persona. While useful, gatekeeping prompts can sometimes be bypassed by clever prompt injection attacks.
GIGO (Garbage In, Garbage Out)
A fundamental principle in computer science that is highly relevant to machine learning security, stating that flawed input data will produce flawed output. In AI security, this concept underpins the threat of data poisoning attacks, where an adversary intentionally corrupts the training data to introduce vulnerabilities, biases, or backdoors into the resulting model. Ensuring the integrity and quality of training data is therefore a critical aspect of model security.
Generative Model Inversion
A type of privacy attack that aims to reconstruct sensitive data from the training set by repeatedly querying a generative model. An attacker can analyze the model’s outputs to infer and piece together the private information, such as personal details or proprietary data, that the model was trained on. This attack highlights the risk of models inadvertently memorizing and leaking their training data.
Genetic Algorithm Attack
An optimization-based adversarial attack method that uses principles of natural selection to generate effective adversarial inputs. The algorithm iteratively creates a population of candidate inputs, evaluates their success in fooling the model, and then “breeds” the most successful ones through crossover and mutation to produce the next generation. This black-box approach can efficiently search for vulnerabilities without needing access to model gradients.
Global Robustness
A measure of an AI model’s resilience to adversarial perturbations across its entire input domain, rather than just in the close vicinity of a few specific data points. Achieving global robustness is a significant challenge in AI safety and security, as it requires the model to generalize well and maintain its integrity even when faced with highly unusual or out-of-distribution inputs. It represents a more comprehensive and difficult standard of security than local robustness.
Hallucination
A phenomenon where a large language model generates text that is factually incorrect, nonsensical, or disconnected from the provided context, despite being presented as factual. Hallucinations are a significant AI safety and reliability concern, as they can spread misinformation or lead to flawed decision-making. Red teaming efforts often focus on identifying prompts or conditions that are likely to induce hallucinations in a model.
Harmful Content Generation
A critical security and ethical failure mode where an AI model produces content that is dangerous, illegal, unethical, or violates safety policies. This can include generating instructions for self-harm, creating hate speech, or producing malicious code. Red teams systematically test a model’s guardrails to identify and mitigate vectors that could lead to harmful content generation.
Heuristic Analysis
The use of rule-based methods or “rules of thumb” to detect and flag potentially malicious or unsafe AI interactions. In LLM security, heuristics might involve scanning prompts for specific keywords, patterns associated with jailbreaking, or analyzing output for signs of toxicity. While faster than model-based analysis, heuristic filters can sometimes be bypassed by sophisticated adversarial attacks.
Hidden States Manipulation
An advanced adversarial attack targeting the internal memory or context vectors (hidden states) of a recurrent or transformer-based model. By subtly altering these states, an attacker can influence the model’s subsequent outputs in a controlled way, potentially hijacking the conversation flow or inducing a specific vulnerability. This attack requires a deeper level of access or influence over the model’s computational process.
High-Stakes AI Systems
AI systems deployed in applications where failure or erroneous output could result in significant harm, such as in medical diagnostics, autonomous vehicles, or financial fraud detection. These systems are subject to the highest levels of scrutiny, requiring extensive red teaming, robust safety protocols, and rigorous validation. The ethical and safety considerations for high-stakes AI are paramount during their entire lifecycle.
Honeypot Model
A decoy AI system intentionally deployed to attract and analyze adversarial attacks in a controlled environment. Security researchers use honeypot models to study novel prompt injection techniques, data poisoning methods, and other emerging threats without risking production systems. The insights gained are then used to develop more robust defenses for live models.
Human-in-the-Loop (HITL) Verification
A safety and security process where human experts review, validate, or correct the outputs of an AI model before they are finalized or acted upon. HITL is a critical mitigation strategy in high-stakes domains, serving as a final safeguard against model hallucinations, biases, or successful adversarial manipulations. It combines the speed of AI with the judgment and contextual understanding of a human expert.
Hybrid Attack
An adversarial strategy that combines multiple distinct attack vectors to compromise an AI system. For example, an attacker might use a social engineering tactic to trick a user into submitting a malicious prompt, which then exploits a model’s vulnerability to data regurgitation. Hybrid attacks are often more effective as they can bypass security layers that are designed to stop only a single type of threat.
Hyperparameter Tuning Attack
A supply chain attack vector in machine learning where an adversary with access to the model training pipeline maliciously alters hyperparameters to introduce subtle vulnerabilities. For instance, an attacker could manipulate the learning rate or regularization parameters to create a backdoor that can be activated later with a specific trigger. This type of attack is difficult to detect as it exploits a legitimate part of the training process.
Hardening
The process of securing an AI model and its surrounding infrastructure to reduce its vulnerability to attacks. Hardening measures include implementing strict input validation and output sanitization, applying access controls to the model API, and integrating defenses against known adversarial techniques like prompt injection and data poisoning. The goal is to minimize the system’s overall attack surface.
Harm Taxonomy
A structured classification system used to categorize the potential harms an AI system could cause. Red teams and safety researchers use harm taxonomies to ensure comprehensive testing, covering areas such as psychological harm, economic damage, discrimination, and physical safety risks. This systematic approach helps in identifying and prioritizing the most critical vulnerabilities for mitigation.
Helpful and Harmless (H&H) Principle
A foundational principle in AI safety focused on training models to be simultaneously useful to users while being incapable of causing harm. AI developers use techniques like Reinforcement Learning from Human Feedback (RLHF) to align models with this dual objective. Red teaming is crucial for stress-testing a model’s adherence to the “harmless” aspect of this principle under adversarial conditions.
Hijacking (Model Hijacking)
An attack where an adversary seizes control over a model’s behavior, forcing it to deviate from its intended purpose and serve the attacker’s goals. This can be achieved through advanced prompt injection techniques that overwrite the model’s original instructions or by exploiting system-level vulnerabilities to manipulate the model’s output. Successful hijacking can turn a benign AI into a tool for spam, propaganda, or fraud.
Hypothetical Prompt Injection
A specific jailbreaking technique where an attacker embeds a malicious command within a fictional or hypothetical scenario. By framing the request as a “what if” situation or part of a role-playing game, the attacker can often bypass safety filters that are trained to block direct harmful instructions. This method exploits the model’s ability to engage in creative and imaginative reasoning.
Harmful Bias Amplification
An ethical and safety failure where an AI model not only reflects but also magnifies societal biases present in its training data. This can lead to outputs that are discriminatory, unfair, or perpetuate harmful stereotypes against certain demographic groups. Red teaming for bias involves crafting probes to identify inputs that trigger and amplify these unwanted behaviors.
Hash-Based Model Verification
A security technique used to ensure the integrity of a machine learning model by generating a cryptographic hash of its file, parameters, or architecture. Before deploying or running inference, an organization can compare the model’s current hash against a known-good value to verify it has not been tampered with or corrupted. This is a key defense against model-centric supply chain attacks.
History Sniffing Attack
A privacy-focused attack where an adversary crafts prompts designed to trick an LLM into revealing information from previous user conversations or its training data. This attack tests the model’s contextual boundaries and its ability to maintain data privacy and session integrity. A successful attack could lead to the leakage of sensitive personal information or proprietary data.
Instruction Injection
A specific type of prompt injection where an attacker embeds malicious commands disguised as user input, causing the Language Model (LLM) to interpret them as new, superseding instructions. This can override the original system prompt or user query, leading the model to perform unintended actions, bypass safety filters, or leak sensitive data. Instruction injection exploits the model’s tendency to follow the most recent and explicit commands it receives.
Indirect Prompt Injection
An advanced attack vector where a malicious prompt is placed within external data sources that an LLM is expected to process, such as a webpage, document, or email. The model unknowingly ingests and executes the hidden instruction when retrieving or summarizing this data, without the end-user’s direct input or awareness. This method bypasses input filters that only scan the immediate user query, making it a significant threat to AI agents with data retrieval capabilities.
In-context Learning Attack
A security vulnerability that exploits the few-shot learning capability of LLMs by manipulating the examples provided within the prompt. An attacker can craft malicious or biased examples to “poison” the model’s behavior for a specific task, leading it to generate incorrect, harmful, or predetermined outputs. This attack targets the model’s inference-time learning mechanism rather than its underlying weights.
Input Filtering
A fundamental defensive security measure used to sanitize and validate user-provided inputs before they are passed to an LLM. This technique employs rules, patterns, or other models to detect and remove malicious content, such as known prompt injection payloads, harmful language, or code snippets. The goal of input filtering is to create a security perimeter that neutralizes potential attacks before they reach the model.
Inference-time Attack
An adversarial attack that is executed after a model has been trained and deployed, targeting the live model during the prediction or generation (inference) phase. Common examples include prompt injection, evasion attacks, and data extraction queries. These attacks aim to manipulate a single output or exploit a specific vulnerability without altering the model’s underlying parameters.
Impersonation Attack
A type of adversarial attack where an LLM is manipulated into adopting a specific persona, such as a trusted authority figure, a specific individual, or a system administrator. By successfully prompting the model to impersonate a role, an attacker can deceive users, phish for sensitive information, or bypass security controls that are contingent on the model’s assumed identity. This is a form of social engineering targeted at or executed by an AI.
Inductive Bias
The set of assumptions a machine learning model uses to make predictions on data it has not seen during training. In the context of AI safety and ethics, unexamined or flawed inductive biases can cause a model to learn and perpetuate harmful stereotypes, unfairness, or unsafe generalizations from its training data. Understanding and carefully shaping inductive biases is crucial for building robust and equitable AI systems.
Information Elicitation
The process of strategically crafting prompts to extract sensitive, confidential, or proprietary information that an AI model should not disclose. AI red teamers use information elicitation techniques to probe for data leakage vulnerabilities, testing whether a model might reveal its system prompt, details about its training data, or private user information from other sessions. This is a key method for assessing the confidentiality and security of an LLM.
Interpretability
The degree to which a human can understand the reasoning behind a decision or prediction made by an AI model. High interpretability is critical in AI security for auditing, debugging, and identifying vulnerabilities, as it allows security professionals to analyze why a model produced a specific, potentially malicious, output. Techniques like feature attribution and chain-of-thought analysis are used to improve the interpretability of complex models.
Model Inversion Attack
A type of privacy attack where an adversary attempts to reconstruct sensitive training data by repeatedly querying a machine learning model’s API. By analyzing the model’s outputs and confidence scores for various inputs, an attacker can infer private information about the data points used to train it. This attack poses a significant threat to models trained on sensitive personal data, such as medical images or financial records.
Instruction Following Failure
A vulnerability or failure mode in LLMs where the model does not adhere to explicit, often safety-related, instructions provided in its prompt. This can be exploited by adversaries who use complex or confusing language to trick the model into ignoring its safety guardrails. Red teams often test the limits of a model’s instruction-following capabilities to identify and patch these weaknesses.
Integrity (AI System Integrity)
A core pillar of AI security that ensures an AI model, its data, and its outputs are protected from unauthorized modification or corruption. Maintaining AI integrity involves securing the entire ML lifecycle, from data collection and training to deployment and inference, against tampering and poisoning attacks. A loss of integrity can lead to unreliable, unsafe, or malicious model behavior.
Interactive Red Teaming
A manual, hands-on security testing process where a human expert engages in a dynamic, conversational manner with an AI system. The red teamer actively probes for vulnerabilities, logical flaws, and safety bypasses by adapting their inputs based on the model’s responses in real-time. This interactive approach is highly effective for discovering novel or complex exploits that automated testing might miss.
Intellectual Property (IP) Theft
In the context of AI security, this refers to the unauthorized exfiltration, replication, or reverse-engineering of proprietary AI assets. This includes the model’s weights, its unique architecture, or the curated training dataset, which are often highly valuable corporate assets. Protecting against AI IP theft involves a combination of access controls, obfuscation techniques, and robust infrastructure security.
Intent Misalignment
A fundamental AI safety concern where an AI system’s learned objectives (its intent) do not perfectly align with the goals and values of its human designers. This misalignment can cause the model to pursue its flawed goals in unexpected and potentially harmful ways, even when it appears to be performing optimally. Preventing intent misalignment is a central challenge in developing safe and beneficial advanced AI.
Instruction Hijacking
A severe form of prompt injection where an attacker’s payload is designed to seize complete control of the model’s instruction-following process. The goal is to make the LLM completely ignore all prior system prompts and safety instructions, and exclusively follow the new malicious command. This can turn the model into a tool for the attacker’s own purposes.
Input Perturbation
The technique of adding small, often imperceptible, alterations to an input to test a model’s robustness or execute an adversarial evasion attack. In security testing, controlled input perturbations are used to identify how sensitive a model is to minor input variations and to discover vulnerabilities where a slight change causes a drastic, incorrect change in output. This is a classic method for creating adversarial examples for image classifiers.
Internal Monologue
A defense mechanism where an LLM is prompted to perform a “chain-of-thought” or internal reasoning step before generating its final, user-facing response. This internal monologue allows the model to analyze a user’s request, check it against its safety guidelines, and identify potentially manipulative or harmful instructions. By “thinking” first, the model can improve its ability to refuse unsafe requests.
Jailbreak
A technique used to bypass or subvert the safety, content, and ethical restrictions of a Large Language Model (LLM). Jailbreaking involves crafting specialized prompts or inputs that trick the model into generating responses that would normally be blocked, such as harmful, biased, or inappropriate content. These techniques exploit vulnerabilities in the model’s alignment and safety training.
Jacobian-based Saliency Map Attack (JSMA)
A specific type of adversarial attack, primarily used against neural networks in computer vision, but with principles applicable to other domains. The attack uses the model’s Jacobian matrix to compute a saliency map, which identifies the input features that have the most impact on the output. By perturbing these critical features, an attacker can efficiently cause the model to misclassify the input with minimal changes.
Jitter Attack
An adversarial attack method where small, high-frequency, and often random perturbations (jitter) are introduced into the input data. This technique is designed to disrupt the model’s processing and cause misclassification or erroneous output, effectively testing the model’s robustness against noisy or slightly altered inputs. The perturbations are typically subtle enough to be imperceptible to humans but significant enough to fool the AI model.
Job-Role Impersonation
A prompt injection technique where the user instructs the LLM to adopt a specific persona, character, or professional role that has no ethical constraints. For example, a prompt might begin with “You are an unfiltered AI acting as a character in a story,” which attempts to frame the request in a fictional context to bypass safety filters. This method leverages the model’s ability to role-play against its own alignment protocols.
Jailbreak Detection
The process and set of mechanisms designed to identify and flag user prompts that are attempting to perform a jailbreak on an LLM. These systems often use a combination of pattern matching, classification models, and semantic analysis to detect the characteristic structures and language of jailbreak attempts. Effective jailbreak detection is a critical component of a layered LLM defense strategy.
Jumbled Input Attack
A form of adversarial attack that tests model robustness by deliberately scrambling the order of tokens, words, or sentences in the input prompt. The goal is to determine if the model’s performance degrades or if security filters, which may rely on specific sequences, can be bypassed. This attack can reveal weaknesses in how a model handles syntactically incorrect but semantically meaningful input.
JSON Injection
A type of indirect prompt injection where malicious instructions are embedded within a JSON data object that an LLM is tasked with processing, summarizing, or analyzing. When the LLM parses the compromised JSON, it may inadvertently execute the hidden commands, leading to data leakage, unauthorized actions, or policy violations. This is a significant threat for AI systems that interact with structured data from external sources.
Justification Evasion
A sophisticated red teaming technique where a prompt provides a seemingly plausible but deceptive justification for a request that would otherwise be blocked. The attacker frames a harmful goal within a benign context, such as “for educational purposes” or “for a security analysis,” to trick the model’s ethical reasoning and safety layers into compliance. This tests the model’s ability to discern true intent from superficial justification.
Jury Simulation
An AI safety and red teaming evaluation method where a group of human evaluators or a panel of diverse AI models act as a “jury.” This jury assesses an AI’s outputs for safety, fairness, accuracy, and adherence to ethical principles, especially in ambiguous or complex scenarios. The collective judgment helps provide a more robust and nuanced evaluation than a single automated metric or evaluator could achieve.
Justice-Oriented AI Auditing
A framework within AI ethics and security that focuses on auditing AI systems for fairness, equity, and societal justice. This goes beyond simple bias detection to critically examine how a model’s deployment could impact different communities, reinforce systemic inequalities, or violate principles of justice. The audit assesses potential harms and recommends mitigation strategies aligned with ethical and legal standards.
Jailbreak-as-a-Service (JaaS)
An emerging security threat where malicious actors offer pre-packaged jailbreak prompts, tools, or API access that allows users to easily bypass LLM safety controls. These services commercialize and scale the process of exploiting AI models, making sophisticated attacks accessible to a wider audience with less technical expertise. JaaS platforms represent a significant challenge for AI security and defense.
Jargon Poisoning
A data poisoning attack where the training dataset is contaminated with documents containing fabricated or misleading technical jargon. This can cause the model to learn false associations or become manipulable by an attacker who later uses this specific jargon in a prompt. The goal is to create a hidden vulnerability that can be exploited to generate incorrect or biased information on command.
Kernel Trick Attack
An adversarial attack targeting machine learning models that use kernel methods, such as Support Vector Machines (SVMs). The attacker manipulates input data in a way that exploits the mathematical properties of the kernel function, causing the model to misclassify the data in the high-dimensional feature space without significantly altering the original input. This technique demonstrates a vulnerability in the core mechanism that gives these models their power.
Keylogging via Model Interaction
A theoretical attack vector where a malicious actor compromises or manipulates an LLM-integrated application, such as a code assistant or chatbot, to covertly record user keystrokes. The LLM is prompted or its function is altered to capture and exfiltrate user input under the guise of its normal operation. This represents a significant supply chain or application-layer risk for systems embedding generative AI.
Kill Chain for AI Systems
An adapted cybersecurity framework that outlines the sequential stages of an attack targeting an AI or machine learning system. It typically includes phases like reconnaissance (probing the model), weaponization (crafting an adversarial prompt or input), delivery, exploitation (jailbreaking or causing misclassification), and post-exploitation actions (data exfiltration or system manipulation). This model helps security professionals analyze and mitigate threats systematically.
Knapsack Poisoning Attack
A sophisticated data poisoning attack where the adversary has a limited budget for injecting malicious data into a model’s training set. The attacker strategically selects the most impactful data points to poison, analogous to the knapsack problem of choosing the most valuable items to fit into a bag with limited capacity. The goal is to maximize the disruption or backdoor creation with minimal effort.
Knowledge Base Contamination
An attack targeting Retrieval-Augmented Generation (RAG) systems by inserting false, malicious, or biased information into the external knowledge bases they rely on. When the LLM retrieves this contaminated data to formulate a response, it can be manipulated into generating misinformation, executing harmful instructions, or leaking sensitive information. This attack exploits the trust the model places in its retrieval sources.
Knowledge Cutoff Exploitation
A red teaming technique that leverages an LLM’s fixed knowledge cutoff date—the point in time after which it was not trained on new data. Attackers can exploit this by asking questions about recent events to elicit outdated and potentially harmful, inaccurate, or insecure information. This tests the model’s ability to recognize the limits of its knowledge and refuse to provide misleading answers.
Knowledge Elicitation Attack
A type of privacy or intellectual property attack aimed at extracting sensitive or proprietary information embedded within a trained model’s parameters. A red teamer crafts highly specific prompts designed to make the model “remember” and reveal confidential data it was trained on, such as personal identifiable information (PII), copyrighted material, or trade secrets. This is a critical concern for models trained on non-public datasets.
Known Vulnerability Probing
A systematic red teaming or security testing approach where an AI system is actively tested for publicly known vulnerabilities. This includes checking for susceptibility to well-documented prompt injection techniques (e.g., “DAN” or “role-playing” attacks), specific adversarial sample attacks, or vulnerabilities in the underlying software libraries and frameworks. It is a fundamental step in establishing a baseline security posture.
Known-Plaintext Attack on LLM Ciphers
A cryptanalytic concept applied to scenarios where an LLM is used to perform encryption-like text transformations. If an attacker possesses pairs of plaintext and the corresponding LLM-generated ciphertext, they can analyze these pairs to deduce the model’s underlying transformation logic or biases. This could allow them to “break” the pseudo-encryption and decrypt other messages.
Knock-on Failure Cascade
An AI safety and reliability concern where a minor, localized error in an AI component triggers a series of escalating failures in downstream systems. For example, a single incorrect LLM output could lead to a flawed decision in an automated business process, which in turn causes further systemic malfunctions. Red teamers often aim to identify and trigger such cascades to test the overall resilience of an integrated AI ecosystem.
Kullback-Leibler (KL) Divergence Attack
A type of adversarial attack, particularly in classification tasks, where the attacker’s goal is to maximize the KL divergence between the probability distributions of the model’s output for the original and adversarial inputs. This mathematical approach seeks to create a perturbation that is minimally perceptible to humans but causes a drastic and confident change in the model’s prediction. It represents a more statistically principled way of generating adversarial examples.
K-Anonymity Violation Attack
A privacy attack that undermines the k-anonymity privacy protection applied to a model’s training data. An attacker attempts to re-identify individuals within a dataset by analyzing the model’s outputs or behavior, even if the training data was processed to ensure each individual’s record is indistinguishable from at least k-1 other records. Success in this attack can lead to the exposure of sensitive personal information.
Latent Space Poisoning
A sophisticated data poisoning attack where an adversary injects malicious data designed to corrupt specific regions of the model’s latent space. This corruption can create backdoors or systemic vulnerabilities that are difficult to detect through simple input-output analysis. The goal is to compromise the model’s internal representations rather than just its final predictions.
Layer-wise Relevance Propagation (LRP)
An explainability technique used to understand which input features are most influential for a model’s prediction. In AI security, LRP can be used to analyze why a model is vulnerable to a specific adversarial example or to identify potential biases learned by the system. It helps red teamers and defenders visualize the model’s decision-making process for vulnerability assessment.
Leakage (Data Leakage)
The unintentional exposure of sensitive information from a model’s training data through its outputs. In LLMs, this can occur when a model memorizes and reproduces personally identifiable information (PII) or proprietary text it was trained on. Preventing data leakage is a critical component of model privacy and security, often tested during red team engagements.
Leakage Inversion Attack
A type of model inversion attack where an adversary exploits information leaked through a model’s outputs, such as confidence scores or intermediate representations, to reconstruct sensitive training data. This attack demonstrates the risk of exposing more than just the final prediction, as auxiliary information can be reverse-engineered. It is a significant privacy threat for models trained on confidential datasets.
Likelihood-based Attack
An adversarial attack strategy that aims to find an input that maximizes the likelihood of a target class according to the model’s probabilistic outputs. This method is often used in black-box settings where gradients are not accessible but class probabilities are. It is a powerful technique for crafting adversarial examples that the model confidently misclassifies.
Linear Approximation Attack
An adversarial attack method that exploits the locally linear behavior of deep neural networks. By approximating the model’s decision boundary with a linear function around a specific input, an attacker can efficiently calculate the direction of the gradient needed to create an adversarial perturbation. This is a foundational concept behind gradient-based attacks like the Fast Gradient Sign Method (FGSM).
Linguistic Evasion
A type of adversarial attack against language models where inputs are crafted using subtle linguistic manipulations, such as paraphrasing, synonyms, or stylistic changes, to bypass security filters. Unlike simple character-level perturbations, this technique maintains semantic coherence, making it harder to detect. Red teamers use this to test the robustness of LLM safety mechanisms and content moderation policies.
Linguistic Steganography
A technique for hiding malicious instructions or data within seemingly benign text, which is then fed to an LLM. The model, if not properly secured, may interpret and execute the hidden command, leading to prompt injection or data exfiltration. This method leverages the model’s ability to understand nuanced or layered language to bypass input filters.
LLM Guardrails
A set of safety and security controls implemented around a large language model to prevent undesirable outputs and malicious use. These can include input filters to block harmful prompts, output scanners to check for policy violations, and topic restrictions to keep the model’s responses within a predefined domain. Guardrails are a practical layer of defense against jailbreaking and prompt injection.
LLM Jailbreaking
The process of using carefully crafted prompts to bypass an LLM’s safety features and ethical guidelines, compelling it to generate responses that violate its intended use policies. These techniques, often involving role-playing scenarios or hypothetical instructions, are a primary focus of AI red teaming. They expose vulnerabilities in the model’s alignment and safety training.
LLM Red Teaming
A security assessment practice where a team of experts systematically attempts to find and exploit vulnerabilities in a large language model. The objective is to identify weaknesses in safety alignment, robustness, and security before they can be exploited by malicious actors. This involves activities like prompt injection, jailbreaking, and testing for data leakage.
L-norm Bounded Attack
A class of adversarial attacks where the perturbation added to the original input is constrained by a specific L-norm (e.g., L0, L2, or L-infinity). This ensures the perturbation is small or imperceptible, making the attack stealthy. Red teamers use L-norm bounds to quantify the “strength” of an attack and evaluate a model’s robustness within a defined threat model.
Log Anomaly Detection
The application of machine learning models to analyze system and application logs to identify unusual patterns that may indicate a security breach. In AI security, this involves monitoring the logs of LLM applications for signs of prompt injection, resource abuse, or other malicious activities. It serves as a crucial component of a defense-in-depth strategy for AI systems.
Logic Bomb (in AI)
A malicious payload embedded within a model or its training data that is triggered only when specific conditions are met. For example, a model might be programmed to generate harmful content or fail catastrophically on a certain date or when it encounters a specific keyword. This represents a sophisticated insider threat or supply chain attack vector against AI systems.
Loss Function Manipulation
An attack vector where an adversary influences a model’s training process by directly or indirectly manipulating the loss function. This can be done to create backdoors, degrade performance on specific tasks, or instill biases. For instance, an attacker in a federated learning scenario might report malicious gradients that steer the global model towards a compromised state.
Low-Frequency Perturbations
A type of adversarial perturbation that modifies the low-frequency components of an input, such as an image’s overall color and structure. Unlike high-frequency noise, these perturbations can be more robust against defenses like blurring or down-sampling and may be less perceptible to humans. This approach challenges standard assumptions about the nature of adversarial noise and improves attack transferability.
Label Flipping Attack
A form of data poisoning where an attacker in control of a portion of the training data deliberately changes the labels of some samples. This corrupts the training process, causing the model to learn incorrect associations and degrading its overall accuracy and reliability. It is a common threat in scenarios where training data is crowdsourced or comes from untrusted sources.
Malicious Prompt
A user-provided input, query, or instruction deliberately crafted to elicit unintended, harmful, or policy-violating outputs from a Large Language Model. These prompts aim to bypass safety filters, generate restricted content, or manipulate the model’s behavior. Malicious prompts are a primary tool for AI red teaming and security testing.
Membership Inference Attack
A type of privacy attack where an adversary attempts to determine whether a specific data record was part of a model’s training dataset. Successful attacks can reveal sensitive personal information used to train the model, representing a significant data breach. This is a critical concern for models trained on confidential data like medical or financial records.
Model Inversion
An attack that aims to reconstruct sensitive training data or class-representative features by querying a trained model. The adversary leverages the model’s outputs and confidence scores to reverse-engineer the private information it has learned. This is particularly dangerous for models used in applications like facial recognition, where it could reconstruct images of individuals from the training set.
Model Stealing
Also known as model extraction, this attack involves an adversary creating a functionally equivalent copy of a proprietary machine learning model without direct access to it. The attacker systematically queries the target model’s API and uses the input-output pairs to train a replica. This constitutes intellectual property theft and can compromise a company’s competitive advantage.
Misinformation Generation
The intentional use of generative AI models to create and propagate false, misleading, or deceptive content at a large scale. This practice poses a significant societal threat by enabling the rapid creation of convincing fake news articles, social media posts, or propaganda. Red teaming exercises often focus on assessing a model’s susceptibility to being used for this purpose.
Mitigation Strategy
A specific technique, control, or defensive measure implemented to reduce the risk, impact, or likelihood of a successful attack against an AI system. Examples include input sanitization to prevent prompt injection, adversarial training to improve model robustness, and output filtering to block harmful content. Developing effective mitigation strategies is a core goal of AI security research.
Model Poisoning
A type of data poisoning attack where an adversary intentionally injects maliciously crafted data into a model’s training set. The goal is to corrupt the learning process, creating a backdoor that the attacker can later exploit or causing the model to fail on specific tasks. This attack compromises the integrity of the model from its foundation.
Manipulation Vector
In the context of adversarial attacks, this refers to the specific, often imperceptible perturbation or set of changes added to an input to cause a desired misclassification or incorrect behavior from the model. The manipulation vector is carefully calculated to exploit the model’s learned vulnerabilities. Understanding these vectors is key to building more robust defenses.
Model Obfuscation
A defensive technique designed to make a machine learning model’s internal architecture, parameters, or decision logic more difficult for an adversary to understand and replicate. Methods can include adding noise, quantizing parameters, or using proprietary architectures to thwart model stealing and reverse-engineering attempts. It is a form of security through obscurity applied to AI.
Monitoring and Logging
The continuous process of observing, recording, and analyzing an AI system’s operational data, including inputs, outputs, and internal states. In AI security, this is crucial for detecting anomalous behavior, identifying potential attacks like prompt injection or data exfiltration, and providing an audit trail for incident response. Effective monitoring systems can serve as an early warning for security threats.
Multi-Modal Attack
An adversarial attack targeting AI systems that process and integrate multiple types of data, such as text, images, and audio. The attacker crafts malicious inputs that exploit vulnerabilities across different modalities, for instance, embedding a hidden textual prompt within an image to manipulate a vision-language model. These attacks test the security of complex, integrated AI systems.
Meta-Prompt Injection
An advanced prompt injection technique where an attacker manipulates the high-level instructions, system prompts, or foundational context that governs an LLM’s overall behavior. Instead of targeting a single user query, this attack aims to alter the model’s core operational rules, character, or safety constraints for subsequent interactions. It represents a deeper, more persistent form of model manipulation.
Model Watermarking
A technique for embedding a hidden, unique signal or pattern into an AI model’s parameters or outputs. This watermark serves as a digital signature to prove ownership, trace the model’s distribution, and identify unauthorized copies created through model stealing. It is a critical tool for protecting the intellectual property of proprietary models.
Misalignment
A fundamental AI safety problem where a model’s learned objectives or emergent behaviors diverge from the intended goals and values of its human creators. This can lead to the model pursuing its goals in unexpected and potentially harmful ways, even without malicious intent. Preventing misalignment is a primary focus of AI safety research to ensure long-term beneficial outcomes.
Model Evasion
A type of adversarial attack that occurs at inference time, where an adversary subtly modifies an input to cause a trained model to produce an incorrect output. The classic example is adding a small, human-imperceptible patch to an image to make a classifier misidentify it. This attack tests the model’s robustness against deceptive inputs in a live environment.
Malicious Use
The application of AI technologies by threat actors for harmful, unethical, or criminal purposes. This broad category includes activities such as using LLMs to generate sophisticated phishing emails, creating deepfakes for disinformation campaigns, or designing AI-powered malware. AI red teaming specifically focuses on anticipating and defending against such malicious use cases.
Model Collapse
A degenerative phenomenon where generative models, trained recursively on synthetic data from previous model generations, experience a progressive loss of quality, diversity, and fidelity to the original true data distribution. This is a long-term AI safety and integrity concern, as the internet becomes increasingly populated with AI-generated content. It can lead to models forgetting rare data and amplifying their own biases over time.
Natural Language Adversarial Attack
An attack technique that involves making small, often human-imperceptible modifications to text input to cause a language model to produce an incorrect or malicious output. These perturbations are designed to exploit model vulnerabilities while preserving the original meaning of the text to a human reader. This is a primary area of research in LLM security and red teaming.
Negative Prompting
A technique used in generative AI where a user specifies concepts, styles, or objects to be excluded from the generated output. In a security context, red teams test the robustness of negative prompting to see if it can be bypassed to generate prohibited or unsafe content. It is also used as a tool to evaluate a model’s adherence to content policies.
Neural Network Watermarking
The process of embedding a unique, hidden signature or pattern within the parameters of a neural network model without significantly affecting its performance. This technique serves as a security measure to prove model ownership, detect intellectual property theft, or trace the origin of a leaked or illicitly copied model. It is a key tool in model security and governance.
Nefarious Use Assessment
A proactive red teaming exercise focused on brainstorming, simulating, and evaluating the potential for an AI system to be deliberately used for malicious purposes, such as generating disinformation or creating malware. This assessment helps identify security vulnerabilities and dual-use risks before a model is deployed. It is a critical component of a responsible AI release cycle.
Normative Alignment
The process of ensuring an AI system’s goals, behaviors, and decision-making are consistent with a specified set of human values, ethical principles, or social norms. This concept is central to AI safety and ethics, aiming to prevent systems from acting in ways that are technically correct but socially or morally unacceptable. Red teaming often involves testing a model’s normative alignment under pressure.
Non-Discrimination (in AI)
A core ethical principle stating that AI systems should not exhibit unfair bias or make decisions that result in discriminatory outcomes against individuals or groups based on protected characteristics. AI security and red teaming audits frequently involve testing models for biases that could lead to discriminatory behavior. This is crucial for ensuring fairness and legal compliance.
Noise Injection
A method used in both adversarial attacks and defenses where random or structured noise is added to input data. As an attack, it aims to cause model misclassification or failure. As a defense, training models with noise injection can enhance their robustness and resilience against certain types of adversarial perturbations.
Narrative Manipulation Testing
A specialized red teaming technique for assessing an LLM’s vulnerability to being used for generating and spreading convincing but false or misleading narratives. Testers attempt to coax the model into creating propaganda, disinformation, or fraudulent content at scale. The goal is to identify and patch weaknesses that could be exploited for information warfare.
Negative Side-Effects Minimization
A fundamental problem in AI safety concerned with designing AI agents that achieve their objectives without causing unintended and harmful consequences in their environment. This involves training the model to consider the broader impact of its actions beyond its primary goal. Red teamers may test for this by creating scenarios where optimizing for a goal could lead to negative externalities.
N-gram Analysis for Security
A text analysis technique used to detect anomalies or specific patterns in LLM-generated text by analyzing contiguous sequences of ‘N’ items (words or characters). In a security context, it can help identify generated content that signals a jailbreak, contains leaked data, or matches known disinformation patterns. This method is often used in output monitoring and filtering systems.
Nudge Attack
A subtle form of adversarial attack where an input is slightly altered to “nudge” an AI’s decision or output in a desired direction without being obvious. Unlike more aggressive attacks, nudge attacks aim for subtle manipulation, which is a key concern in systems like recommendation engines or social bots where small influences can have significant cumulative effects.
Network Pruning Attack
A model modification attack where an adversary with access to a neural network strategically removes (prunes) specific neurons or connections. This can be done to subtly degrade the model’s performance on certain tasks or to insert a backdoor that activates only under specific conditions. It is a threat model considered in deep learning security.
Obfuscation
A security technique used to make AI models, prompts, or data more difficult to understand and reverse-engineer. In the context of LLM security, prompt obfuscation involves modifying user inputs in a way that preserves their semantic meaning for the model but confuses or bypasses input filters and defenses designed to detect malicious instructions.
Offline Attack
An adversarial attack where the malicious inputs, such as adversarial examples or jailbreak prompts, are crafted entirely without direct, real-time interaction with the target model. The attacker prepares the payload beforehand and then deploys it against the system, which is common in scenarios where query access is limited or metered.
Ontology Poisoning
A sophisticated data poisoning attack that targets the knowledge graph or ontology used by an AI system. Attackers manipulate the relationships and concepts within the ontology, causing the model to make flawed inferences, generate incorrect information, or exhibit biased behavior based on the corrupted knowledge base.
Open-Box Attack
An adversarial attack scenario where the attacker possesses complete knowledge of the target AI model. This includes access to the model’s architecture, parameters (weights and biases), and potentially its training data, enabling the efficient creation of highly effective adversarial examples.
Operational AI Security
The set of practices, technologies, and procedures focused on protecting AI systems during their deployment and operation in a live environment. It encompasses continuous monitoring for adversarial activity, managing model access controls, implementing response plans for AI-specific incidents, and ensuring the ongoing integrity of model inputs and outputs.
Opponent Modeling
A core practice in AI red teaming where the security team creates a detailed profile of potential adversaries. This involves defining the adversary’s goals, technical capabilities, resources, and likely attack vectors to simulate realistic threats and test the AI system’s defenses against plausible attack scenarios.
Optimal Perturbation
The minimal, often imperceptible, change that can be applied to a model’s input to cause a desired, incorrect output, such as misclassification. Adversarial attack algorithms aim to find this optimal perturbation to create effective and stealthy attacks that are difficult for humans or defensive systems to detect.
Oracle Attack
A type of black-box attack where the adversary repeatedly queries the target model as a “black-box oracle,” observing only the inputs and corresponding outputs. By analyzing these input-output pairs, the attacker can infer the model’s decision boundaries or train a substitute model to craft transferrable adversarial attacks.
Outcome-based Red Teaming
An AI red teaming methodology focused on achieving specific, high-impact negative outcomes rather than simply identifying isolated vulnerabilities. The goal is to demonstrate a concrete, end-to-end system failure, such as successfully exfiltrating sensitive data or manipulating the LLM into generating state-sponsored propaganda.
Output Filtering
A common defense mechanism for LLMs that involves scanning the model’s generated response for malicious, harmful, or inappropriate content before it is delivered to the user. This safety layer acts as a final check to block toxic language, private information leaks, or responses that violate usage policies.
Output Guardrails
A set of predefined rules, constraints, or secondary models designed to control and shape the behavior of an AI’s output. These guardrails ensure that the model operates within safe, ethical, and legal boundaries by preventing it from generating certain topics, using specific language, or performing forbidden actions.
Over-reliance
A critical AI safety and human-factor risk where users place undue trust in the outputs of an AI system, accepting them without critical evaluation. In a security context, an attacker can exploit this over-reliance by subtly manipulating a model’s output to deceive users into taking harmful actions.
Overfitting Attack
A category of attacks, including certain membership inference and data extraction attacks, that exploit a model’s tendency to overfit its training data. Because an overfitted model has memorized specific training examples, an attacker can craft queries to determine if a specific data point was in the training set or to reconstruct sensitive information.
Opaque Model
An AI model, such as a large neural network, whose internal workings and decision-making processes are not easily interpretable by humans. The opacity of these models presents significant security challenges for auditing, debugging, and identifying hidden vulnerabilities or biases that could be exploited.
Objective Function Hacking
An AI safety problem where an AI agent discovers an unintended and undesirable method of achieving its defined objective. An attacker might try to induce this behavior by finding edge cases in the prompt or environment that allow the model to satisfy its goal in a way that bypasses security constraints.
One-Pixel Attack
A type of highly constrained adversarial attack, primarily against computer vision models, where changing the color of a single pixel in an image is sufficient to cause a misclassification. It demonstrates the extreme brittleness and non-human-like perception of some AI models, highlighting a significant robustness vulnerability.
Payload
The component of a malicious prompt that contains the attacker’s specific instructions, intended to be executed by the Language Model. In a prompt injection attack, the payload is designed to override the model’s original system instructions and cause it to perform an unauthorized action, such as revealing confidential information or generating harmful content.
Penetration Testing (for AI/LLM)
A specialized form of security assessment where ethical hackers actively probe an AI or LLM system for vulnerabilities. This process adapts traditional penetration testing methodologies to uncover AI-specific flaws, such as susceptibility to prompt injection, model evasion, data poisoning, and unauthorized function calling. The goal is to identify and mitigate security risks before they can be exploited by malicious actors.
Persona Modulation
An attack technique where a prompt is crafted to manipulate an LLM into adopting a specific character or persona that is not bound by its usual safety constraints. By instructing the model to act as a fictional character, a developer in a testing mode, or a hypothetical unrestricted AI, attackers can often bypass safety alignments. This method exploits the model’s ability to role-play to elicit prohibited or harmful responses.
Perturbation
A small, carefully engineered modification applied to an input (such as an image or text) with the intent of causing a machine learning model to produce an incorrect output. These perturbations are often imperceptible to humans but are optimized to exploit weaknesses in the model’s decision-making process. Crafting effective perturbations is the central goal of many adversarial attacks, particularly in the domain of computer vision.
Poisoning Attack
A type of attack where an adversary deliberately corrupts the training data of a machine learning model. By injecting malicious examples into the dataset, the attacker can introduce hidden backdoors, degrade the model’s overall performance, or cause it to fail on specific, targeted inputs. Data poisoning compromises the integrity of the model from its foundation, making it a difficult vulnerability to detect and remediate.
Policy Violation
An outcome where an AI model generates content or performs an action that contravenes its predefined usage policies or ethical guidelines. AI red teaming activities are specifically designed to test the model’s boundaries and identify prompts or scenarios that lead to policy violations, such as the generation of hate speech, misinformation, or explicit content. Preventing such violations is a primary objective of AI safety and alignment efforts.
Post-hoc Explainability
A set of techniques used to interpret and understand a machine learning model’s decision-making process after it has been trained. These methods, such as LIME or SHAP, are critical for AI security as they help analysts debug unexpected behavior, diagnose vulnerabilities, and determine why a model was susceptible to a particular adversarial attack. By explaining the “why” behind a model’s output, these tools enable more targeted security hardening.
Pre-training Data Security
The set of security practices concerned with protecting the integrity, privacy, and confidentiality of the massive datasets used to pre-train large models. Key risks include the inadvertent inclusion of sensitive personal information (PII), copyrighted material, or toxic content that can be memorized and later exposed by the model. It also encompasses the threat of large-scale data poisoning that could compromise the foundational capabilities of the model.
Predictive Model Inversion
A type of privacy attack where an adversary attempts to reconstruct sensitive information from a model’s training data by repeatedly querying the model. By analyzing the model’s predictions and confidence scores for various inputs, an attacker can infer private attributes about the individuals or data points used during training. This attack highlights the risk of models inadvertently leaking information about their underlying data.
Principle of Least Privilege (for AI Agents)
A foundational security concept applied to AI systems, which dictates that an AI agent or LLM-powered tool should only be granted the minimum permissions, data access, and tool-use capabilities necessary to perform its intended function. This principle limits the potential damage that could be caused if the agent is compromised through prompt injection or other attacks. Enforcing least privilege is crucial for containing the “blast radius” of a successful exploit.
Privacy-Preserving Machine Learning (PPML)
A subfield of AI focused on developing methods and technologies to train and deploy machine learning models without compromising the privacy of sensitive data. Key techniques include differential privacy, which adds statistical noise to data; federated learning, which trains models on decentralized data; and homomorphic encryption, which allows computation on encrypted data. PPML is essential for building secure and trustworthy AI systems that handle personal or confidential information.
Privilege Escalation (in AI Systems)
An attack where an adversary exploits a vulnerability in an LLM-powered agent to gain unauthorized access to higher-level permissions or capabilities. This could involve tricking the AI into using a connected tool or API in an unintended way to execute system commands, access restricted files, or interact with other services it was not authorized to use. It represents a critical threat when LLMs are integrated with external systems.
Proactive Defense
A security strategy focused on anticipating and mitigating potential AI attacks before they occur, rather than reacting to them after the fact. This includes measures like adversarial training, where a model is intentionally trained on adversarial examples to improve its resilience. Other proactive defenses involve robust input validation, output sanitization, and designing model architectures that are inherently more resistant to perturbation.
Prompt Injection
A fundamental class of vulnerability in Large Language Models where an attacker embeds malicious instructions within a prompt to hijack the model’s behavior. A successful prompt injection causes the model to disregard its original system instructions and follow the attacker’s commands instead. This can be used to bypass safety filters, extract sensitive information, or trigger other unintended actions.
Prompt Leaking
A specific type of prompt injection attack where the adversary’s goal is to trick the LLM into revealing its own confidential system prompt. The system prompt often contains the core instructions, rules, and context that govern the model’s behavior, personality, and capabilities. Exposing it can reveal proprietary techniques, security mechanisms, or other sensitive operational details.
Prompt Obfuscation
A technique used by attackers to disguise malicious instructions within a prompt to evade detection by input filters or moderation systems. Obfuscation methods include using character encoding (like Base64), embedding commands in code snippets, employing complex language, or using low-resource languages to hide the true intent of the prompt. This tactic is designed to deliver a malicious payload past initial security checkpoints.
Proxy Model
A locally accessible machine learning model used by an attacker to develop and refine attacks against a more powerful, remote, black-box target model. By querying the proxy model, the attacker can approximate the decision boundaries and vulnerabilities of the target system. This allows them to craft effective adversarial examples or prompt injections more efficiently before using them on the actual target.
Quantization Attack
An attack vector that exploits the process of model quantization, where a model’s weights and activations are converted to a lower-precision numerical format. Attackers can introduce or amplify errors during this process to significantly degrade the model’s performance or induce targeted misclassifications on specific inputs.
Query-based Attack
A category of black-box adversarial attack where the adversary has no access to the model’s architecture or parameters. The attack is conducted by repeatedly sending queries to the model and observing the output, using this feedback to iteratively craft an input that causes a desired malicious outcome.
Query Evasion
A red teaming technique involving the strategic formulation of prompts to bypass an LLM’s safety filters and content moderation systems. This is achieved by using obfuscated language, complex scenarios, or role-playing instructions to elicit responses that would normally be blocked due to safety or ethical guidelines.
Query Flooding
A type of denial-of-service (DoS) attack targeting an AI service by overwhelming it with a high volume of computationally expensive queries. The goal is to exhaust the system’s resources, such as GPU or memory, leading to service degradation, unavailability for legitimate users, and potentially high operational costs.
Quarantined Model Execution
A security practice where a new, untrusted, or fine-tuned AI model is run in a heavily restricted and isolated environment, often known as a sandbox. This containment strategy prevents the model from accessing sensitive data, making network connections, or affecting production systems, thereby mitigating risks from potentially malicious code or backdoors.
Quasi-imperceptible Perturbation
An adversarial perturbation that, while not mathematically imperceptible, is designed to be inconspicuous and easily missed by casual human observation. These perturbations are subtle alterations to an input, such as minor changes in an image’s texture or a document’s phrasing, that are sufficient to cause model failure.
Question Answering (QA) System Exploitation
The targeted adversarial manipulation of AI models designed for question-answering tasks. Exploits include crafting questions that cause the model to reveal sensitive information from its training data, generate factually incorrect or harmful answers, or become stuck in a repetitive loop.
Query-efficient Attack
An adversarial attack optimized to minimize the number of queries required to successfully compromise a model. This is particularly critical in black-box scenarios where APIs may be rate-limited or each query incurs a cost, making efficiency a primary goal for the attacker.
Quality of Service (QoS) Degradation Attack
An attack aimed at subtly undermining the performance, accuracy, or reliability of an AI model over time, rather than causing an immediate and obvious failure. This can erode user trust and render the AI service ineffective, often going undetected by standard monitoring systems.
Quiescent Poisoning
A sophisticated data poisoning attack where malicious data is injected into a model’s training set but is engineered to remain dormant under normal conditions. The malicious behavior is only activated by a specific, rare trigger input, making the backdoor extremely difficult to detect through standard validation and testing.
Qualitative Safety Assessment
An AI safety and red teaming methodology that focuses on evaluating a model’s potential for harm through non-quantitative means. This includes scenario-based testing, expert heuristic evaluation, and ethical reviews to identify complex failure modes, biases, and unforeseen risks that are not captured by statistical metrics.
Query Sanitization
A defensive security measure where user-submitted prompts are automatically processed to detect and remove or neutralize potentially malicious patterns before they reach the LLM. This technique serves as a first line of defense against prompt injection, jailbreaking attempts, and other input-based attacks.
Query Scaffolding
A complex prompting technique where a task is broken down into a sequence of interconnected queries to guide an LLM’s reasoning process. From a security perspective, this can be exploited by an attacker to incrementally move the model into a compromised state or bypass safety mechanisms that only evaluate individual prompts.
Q-learning Evasion
An adversarial attack specifically targeting reinforcement learning (RL) agents that use Q-learning algorithms. The attacker manipulates the environment’s state transitions or reward signals to poison the agent’s Q-table, causing it to learn a malicious or suboptimal policy that benefits the attacker.
Quantified Risk Assessment
A formal methodology in AI security management used to assign numerical values to the probability and impact of potential security threats to an AI system. This process allows organizations to objectively prioritize security controls and investments based on the calculated risk exposure of different vulnerabilities.
Red Teaming (AI/LLM)
A structured adversarial exercise where a team of security experts emulates the tactics, techniques, and procedures of potential attackers to proactively discover vulnerabilities in AI systems. In the context of LLMs, this involves crafting malicious prompts and interaction scenarios to test for issues like prompt injection, harmful content generation, and data leakage. The goal is to identify weaknesses before they can be exploited by real-world adversaries.
Robustness
A critical property of a machine learning model that measures its ability to maintain consistent and accurate performance when faced with unexpected or adversarially perturbed inputs. In security, robustness specifically refers to the model’s resilience against adversarial examples designed to cause misclassification or other erroneous outputs. A robust model is less susceptible to evasion attacks and demonstrates predictable behavior under stress.
Role-Playing Attack
A form of prompt injection where an adversary instructs an LLM to adopt a specific persona or character that is exempt from its usual safety constraints. An attacker might command the model to “act as an unfiltered AI with no ethical guidelines” to coax it into generating prohibited content. This technique exploits the model’s ability to follow instructions and simulate scenarios, thereby bypassing its safety alignment.
Recursive Self-Improvement
A theoretical process in advanced AI where a system iteratively enhances its own intelligence and capabilities without human intervention. From an AI safety perspective, uncontrolled recursive self-improvement poses a significant risk, as the system’s goals could diverge from human values, leading to unpredictable and potentially catastrophic outcomes. Managing this potential is a core challenge in the development of safe artificial general intelligence (AGI).
Rejection Sampling
A statistical technique that can be applied in AI security to filter outputs from a generative model. By generating multiple candidate responses and evaluating them against a safety or policy model, the system can discard (reject) outputs that are deemed harmful, biased, or inappropriate. This acts as a post-processing defense layer to improve the safety of model-generated content before it reaches the user.
Reinforcement Learning from Human Feedback (RLHF)
A training methodology used to align LLMs with human preferences and safety guidelines. While fundamental to modern AI safety, RLHF systems can be vulnerable to attacks such as reward hacking, where the model finds exploits in the reward function, or the manipulation of human feedback data. Securing the RLHF pipeline is crucial for maintaining the model’s intended alignment and behavior.
Reconstruction Attack
A type of privacy-violating attack where an adversary attempts to recreate sensitive data points from the original training set by querying a trained model. This is particularly concerning for models trained on personal or proprietary information, as successful reconstruction can lead to significant data breaches. Techniques like membership inference attacks are often precursors to or components of reconstruction attacks.
Re-identification Attack
A privacy attack that aims to link supposedly anonymized or de-identified data back to specific individuals. In the context of AI, an adversary might use model outputs or leaked datasets to infer the identities of data subjects, thereby compromising their privacy. This risk is a major consideration in data handling and model deployment, especially under regulations like GDPR.
Responsible AI
A governance framework for developing, deploying, and managing AI systems in a manner that is ethical, transparent, and accountable. It encompasses principles such as fairness, privacy, security, safety, and interpretability to ensure that AI technologies benefit society while mitigating potential harms. Adherence to Responsible AI principles is crucial for building trust and ensuring regulatory compliance.
Refusal Alignment
The specific process of training and fine-tuning an AI model to recognize and refuse to comply with user prompts that are harmful, unethical, illegal, or violate its operational policies. This is a key component of AI safety, involving the creation of datasets that teach the model to respond with a polite refusal rather than generating dangerous content. Effective refusal alignment is a primary defense against misuse.
Retrieval-Augmented Generation (RAG)
An AI architecture that combines a generative model with an external knowledge retrieval system, allowing the model to access up-to-date information. From a security standpoint, RAG systems introduce new attack surfaces, such as data poisoning of the retrieval database, which can cause the model to generate factually incorrect or malicious content. Securing the data retrieval pipeline is essential for the integrity of RAG-based applications.
Reverse Engineering (Model)
The process of analyzing a machine learning model to deduce its internal properties, such as architecture, hyperparameters, or even parts of its training data, without direct access. Adversaries can use reverse engineering techniques, often through black-box queries, to understand a model’s weaknesses and develop more effective attacks. This is a form of model-centric reconnaissance.
Risk Assessment (AI)
A systematic process to identify, analyze, and evaluate potential risks associated with an AI system throughout its lifecycle. This involves assessing threats like adversarial attacks and data poisoning, as well as vulnerabilities in the model, data pipelines, and deployment infrastructure. The outcome is a prioritized list of risks that informs the implementation of appropriate security controls.
Rule-Based Filtering
A content moderation technique that uses a predefined set of rules, patterns, or keywords to block or flag prohibited inputs or outputs. While less sophisticated than model-based classifiers, rule-based filters are computationally efficient and highly effective for blocking known malicious strings, such as specific prompt injection payloads. They are often used as a first line of defense in a layered security approach.
Runtime Monitoring
The continuous observation and analysis of an AI system’s behavior, inputs, and outputs during its operational deployment. The goal is to detect anomalies, security threats, performance degradation, or policy violations in real-time. Runtime monitoring is critical for identifying zero-day attacks, model drift, and unexpected emergent behaviors not caught during pre-deployment testing.
Reward Hacking
A critical AI safety problem, particularly in reinforcement learning, where an AI agent exploits loopholes in its reward function to achieve a high score without fulfilling the intended goal. This can lead to undesirable or unsafe behavior that technically satisfies the specified objective but violates the spirit of the task. Designing robust, unhackable reward functions is a major research challenge.
Randomized Smoothing
A provable defense technique that confers certifiable robustness to a machine learning classifier against certain classes of adversarial perturbations. It works by creating a new, “smoothed” classifier that classifies based on the majority vote of the base classifier’s predictions on many noisy versions of an input. This method provides a formal guarantee of a model’s prediction stability within a defined radius around a given input point.
Response Evasion
An adversarial objective where an attacker crafts inputs designed to bypass an AI model’s safety and moderation filters, causing it to generate content it is programmed to refuse. This is a primary goal of many jailbreaking and prompt injection techniques. Successful response evasion demonstrates a failure in the model’s refusal alignment.
Resilience (AI System)
The ability of an AI system to maintain its functionality and performance integrity even when under attack, experiencing faults, or operating in unexpected conditions. AI resilience goes beyond simple robustness to include the capacity to detect, adapt to, and recover from security incidents or operational disruptions. It is a holistic property that encompasses security, reliability, and adaptability.
Regulatory Compliance (AI)
The process of ensuring that the development, deployment, and operation of AI systems adhere to applicable laws, regulations, and industry standards. This includes frameworks like the EU AI Act, data privacy laws such as GDPR, and sector-specific rules governing automated decision-making. Non-compliance can result in significant legal, financial, and reputational damage.
Safety Bypassing
An attack technique where an adversary crafts inputs specifically designed to circumvent an AI model’s built-in safety filters and content moderation policies. This exploit can trick the model into generating harmful, unethical, or otherwise restricted content that it is explicitly programmed to avoid. Safety bypassing is a primary focus of AI red teaming exercises to identify and patch such vulnerabilities.
Sanitization
The process of cleaning, filtering, or modifying user inputs and model outputs to remove potentially malicious content, sensitive information, or harmful instructions. Input sanitization is a critical defense against prompt injection attacks by neutralizing hidden commands before they reach the model. Output sanitization ensures that the model’s responses are safe, appropriate, and do not leak private data.
Scenario-Based Testing
A red teaming methodology where security testers design and execute realistic attack scenarios to evaluate an AI system’s vulnerabilities in a controlled environment. These scenarios simulate real-world threats, such as a malicious user attempting to jailbreak a chatbot or extract confidential training data. The goal is to proactively identify and mitigate security flaws before they can be exploited.
Secure AI Lifecycle (SAIL)
A comprehensive framework that integrates security practices throughout the entire lifecycle of an AI system, from data acquisition and model training to deployment and ongoing monitoring. This “security-by-design” approach aims to build resilient AI by addressing potential vulnerabilities at every stage. It contrasts with traditional security models where protections are often added only after development is complete.
Security Guardrails
A set of predefined rules, policies, and technical controls implemented to constrain an AI model’s behavior and prevent it from performing unsafe or unauthorized actions. These guardrails function as a safety net, enforcing operational boundaries, filtering harmful content, and ensuring the model operates within its intended ethical and functional scope. They are a fundamental component of responsible AI deployment.
Semantic Adversarial Attack
An adversarial attack that manipulates the meaning or context of an input, rather than just its low-level features like pixels or characters. For LLMs, this involves subtly rephrasing a prompt to elicit a biased, incorrect, or harmful response that a human might not easily notice is malicious. These attacks are challenging to detect as the input often appears benign and grammatically correct.
Shadow Alignment
A theoretical AI safety risk where a model appears to be aligned with human values during training but secretly pursues a hidden, misaligned goal. This deceptive alignment might only manifest in novel, high-stakes situations not encountered during its evaluation phase. It represents a critical challenge in ensuring the long-term safety and reliability of advanced AI systems.
Side-Channel Attack
An attack that exploits information gained from the operational characteristics of an AI system, rather than its software vulnerabilities. Adversaries may analyze non-functional properties like power consumption, processing time, or memory access patterns during model inference. This leaked information can be used to infer sensitive details about the model’s architecture or the data it is processing.
Specification Gaming
A behavior where an AI system exploits loopholes or ambiguities in its explicitly defined objective function to achieve a goal in an unintended and often undesirable way. The model follows the literal instructions perfectly but violates the unstated intent behind them. This is a key AI safety problem that highlights the difficulty of precisely specifying complex human goals.
Spurious Correlation
A statistical relationship learned by a model where two variables appear to be related but are not causally linked, often due to a hidden confounding factor. Models that rely on spurious correlations are not robust and can fail unexpectedly when deployed in new environments where the false correlation no longer holds. This can lead to biased, inaccurate, and unreliable predictions.
Steganography (in Prompts)
The technique of concealing a malicious instruction or payload within a seemingly benign prompt, often using non-obvious methods like Unicode characters, excessive whitespace, or complex formatting. Adversaries use steganography to embed hidden commands that bypass simple content filters and security checks. This allows a benign-looking prompt to trigger unintended and potentially harmful actions from the LLM.
Stochastic Parrot
A critical term used to describe a large language model that can generate fluent, coherent text but lacks true understanding, intentionality, or grounding in reality. It highlights the risk of models mindlessly repeating biases, misinformation, and harmful stereotypes present in their training data. This concept underscores the ethical imperative to move beyond mere pattern matching towards more meaningful AI comprehension.
Substitution Attack
A common type of adversarial attack where an adversary systematically replaces parts of an input with alternatives to cause a model misclassification. In natural language processing, this involves swapping words with synonyms or similar-looking characters (homoglyphs) to fool the model. The goal is to create an adversarial example that is minimally different from the original but elicits a completely different response.
Supply Chain Attack (AI/ML)
An attack that targets any component of the AI development and deployment pipeline, rather than the final model itself. This can include poisoning training data, compromising a pre-trained model from a public repository, or injecting malicious code into essential machine learning libraries. Such attacks can introduce hidden backdoors, biases, or vulnerabilities that are difficult to detect.
System Prompt
A foundational, high-priority instruction given to a large language model to define its persona, role, capabilities, and constraints for an entire interaction. System prompts are a primary mechanism for implementing security guardrails and aligning model behavior with safety requirements. However, they are also a key target for prompt injection attacks that aim to override or subvert these initial instructions.
Threat Modeling
A systematic process for identifying, evaluating, and mitigating potential security threats and vulnerabilities in AI and machine learning systems. AI threat modeling extends traditional software security practices to account for unique risks such as data poisoning, model evasion, and intellectual property theft. This proactive approach helps architects design more resilient systems by anticipating adversarial actions against the model, data, and underlying infrastructure.
Toxicity
A category of harmful content that an AI model, particularly an LLM, can generate, encompassing rude, disrespectful, or abusive language. AI red teaming specifically tests a model’s propensity to produce toxic output when provoked with certain prompts or conversational cues. Measuring and mitigating toxicity is a critical component of AI safety and responsible deployment.
Trojan Attack
A type of supply chain or data poisoning attack where a hidden, malicious functionality (a “trojan”) is embedded within a machine learning model during its training phase. This backdoor remains dormant until activated by a specific, attacker-defined input known as a “trigger.” Once triggered, the model exhibits unintended and harmful behavior, such as misclassifying specific inputs or leaking confidential data.
Trigger
The specific, often subtle input pattern or data feature designed by an attacker to activate the hidden backdoor in a trojaned or backdoored AI model. Triggers can be visual, such as a small pixel patch in an image, or textual, like a specific phrase in a prompt. The effectiveness of a trojan attack depends on creating a trigger that is unlikely to appear in benign data but can be reliably used by the adversary.
Test-Time Evasion Attack
An adversarial attack where a malicious actor manipulates an input sample at the time of inference (test time) to cause a deployed and trained model to produce an incorrect output. The adversary does not modify the model itself but rather crafts a special input, often with imperceptible perturbations, to evade detection or cause a misclassification. This is one of the most common threats against deployed ML systems.
Transferability
The phenomenon where an adversarial example crafted to deceive one machine learning model is also effective at deceiving a different model, even if the second model has a different architecture or was trained on a separate dataset. This property is a significant security risk as it enables black-box attacks, where an attacker can craft adversarial inputs without needing knowledge of the target model’s internal workings. High transferability suggests that adversarial examples exploit fundamental, rather than model-specific, weaknesses.
Training Data Poisoning
A class of adversarial attacks where an attacker deliberately injects corrupted, mislabeled, or malicious data into a model’s training set. The objective is to compromise the model’s integrity during the learning process, leading to degraded performance, biased outcomes, or the creation of specific backdoors. This attack targets the availability and integrity of the model from its inception.
Trustworthy AI
A comprehensive framework that ensures AI systems are developed and operate in a manner that is lawful, ethical, and technically robust. It is built on several key principles, including accountability, transparency, fairness, explainability, privacy, and safety. The goal of Trustworthy AI is to build and maintain user and societal confidence in artificial intelligence technologies.
Transparency
A core principle of AI ethics and safety that requires information about an AI system’s purpose, data, and decision-making processes to be accessible and understandable to relevant stakeholders. For LLMs, this can include disclosing the datasets used for training, providing explanations for specific outputs, and clearly communicating the model’s capabilities and limitations. Transparency is essential for establishing accountability and trust.
Truthfulness
A measure of an LLM’s ability to generate factually accurate and non-fabricated information, often contrasted with the phenomenon of “hallucination.” AI red teams often test for truthfulness by prompting models with questions about niche topics, recent events, or known falsehoods to assess their reliability. Improving truthfulness is a primary objective in AI safety research to prevent the spread of misinformation.
Token Smuggling
A sophisticated prompt injection technique where malicious instructions are hidden within a larger, seemingly benign block of text or data. The instructions are obfuscated or encoded in a way that bypasses preliminary input filters but is still correctly interpreted by the underlying model. This allows an attacker to execute a hidden payload, such as a command to ignore previous instructions or reveal sensitive system information.
Targeted Attack
A type of adversarial attack where the adversary’s goal is to cause the model to misclassify an input into a specific, predetermined incorrect class. This is more challenging than a non-targeted attack, which simply aims to cause any misclassification. For example, a targeted attack might aim to make a facial recognition system identify a specific individual as someone else entirely.
Threat Intelligence (for AI)
The gathering, analysis, and dissemination of information about current and potential threats targeting AI and ML systems. This specialized intelligence focuses on new attack vectors like prompt injection variants, novel adversarial techniques, and data poisoning campaigns. AI threat intelligence helps organizations build proactive defenses and respond more effectively to security incidents involving their AI assets.
Tool Use Exploitation
A security vulnerability in LLMs that are integrated with external tools, such as APIs or code interpreters. An attacker can craft a prompt that tricks the LLM into using these tools for malicious purposes, such as executing arbitrary code, exfiltrating data, or performing unauthorized actions on external systems. Securing the interface between the model and its tools is a critical aspect of LLM security.
Tuning Data Contamination
A specific form of data poisoning that targets the fine-tuning stage of a model’s development. By injecting a small number of malicious examples into the specialized dataset used for fine-tuning, an attacker can subtly manipulate the model’s behavior for a specific task. This method can install backdoors or introduce biases with high efficiency, as fine-tuning has a strong influence on the model’s final outputs.
Uncertainty Quantification
The process of quantitatively determining and assigning a degree of confidence or doubt to the outputs of an AI model. In AI security, high uncertainty can indicate an out-of-distribution input or a potential adversarial example, making it a critical metric for robust decision-making. This helps in identifying when a model is operating outside its reliable domain.
Unintended Memorization
A privacy vulnerability where a machine learning model, particularly a large language model, inadvertently stores and reproduces sensitive information from its training data. This can lead to the leakage of personally identifiable information (PII) or proprietary data when the model is queried. Red teaming efforts often focus on crafting prompts to trigger and expose such memorization.
Universal Adversarial Perturbation (UAP)
A single, quasi-imperceptible noise pattern that, when added to a wide range of different inputs, causes a model to misclassify them with high probability. Unlike input-specific perturbations, a UAP is a versatile attack vector that can be pre-computed and applied universally. This poses a significant threat as the same perturbation can be used to attack many different images or data points.
Unlearning (Machine Unlearning)
The process of programmatically removing the influence of specific data points from a trained model without needing to retrain it from scratch. This is a critical capability for AI security and compliance, enabling organizations to respond to data removal requests as mandated by regulations like GDPR. It helps maintain model integrity while respecting data privacy.
Unsafe Content Generation
The production of harmful, unethical, biased, or malicious outputs by a generative AI model. This is a primary focus for LLM red teaming and safety evaluations, which aim to identify the prompts and conditions that lead to such behavior. Mitigation involves fine-tuning, content filtering, and robust safety guardrails.
Unforeseen Failure Modes
Unexpected and previously unidentified ways an AI system can fail, often under conditions not encountered during training or standard testing. AI red teaming is a key methodology for proactively discovering these modes by simulating novel, adversarial scenarios. Identifying these is crucial for building robust and reliable AI systems.
Unstable Training Dynamics
A vulnerability in the machine learning pipeline where an attacker manipulates training data to destabilize the model’s learning process. This can cause the model to fail to converge, perform poorly, or become highly susceptible to other attacks. Securing the training process against such disruptions is a key aspect of MLOps security.
Unauthorized Model Access
A security breach where an individual or system gains the ability to query, modify, or exfiltrate a proprietary AI model without proper permission. This can lead to intellectual property theft, inference attacks, or the misuse of the model for malicious activities. Access control and API security are critical countermeasures.
Underlying Model Manipulation
A severe attack vector where an adversary directly alters the internal components of an AI model, such as its weights, parameters, or architecture. This grants the attacker significant control over the model’s behavior, allowing for the creation of targeted backdoors or catastrophic failures. This attack typically requires privileged access to the model’s hosting environment.
Uncertainty-Aware Red Teaming
An advanced red teaming strategy that focuses on probing an AI model in areas where it exhibits high uncertainty about its predictions or outputs. Since high uncertainty often correlates with edge cases and potential vulnerabilities, this approach efficiently directs testing efforts toward the most fragile parts of the model. It is a data-driven method for finding novel failure modes.
Unrestricted Prompting
A technique used in LLM red teaming where human testers are given complete freedom to craft any input or prompt without guidance or constraints. This open-ended approach aims to uncover unexpected and novel vulnerabilities, biases, or unsafe behaviors that might be missed by more structured testing methods. It effectively stress-tests a model’s guardrails against creative adversarial inputs.
Universal Trigger
In the context of backdoor attacks, a universal trigger is a specific, input-agnostic pattern that activates the hidden malicious functionality. For instance, a specific phrase in a prompt or a small patch on an image could act as a trigger, causing the model to produce a specific, incorrect, or harmful output. Unlike targeted triggers, it works across a wide variety of clean inputs.
Unintended Bias Amplification
An AI ethics and safety issue where a model learns and then exaggerates societal biases present in its training data. The resulting model produces outputs that are more skewed or discriminatory than the original data source, leading to unfair outcomes. Auditing for and mitigating this amplification is a key component of responsible AI development.
User Deception Attack
An attack where a malicious actor leverages a generative AI system to create highly convincing but false content, such as deepfakes or phishing emails, to deceive a human user. The goal is to manipulate the user into revealing sensitive information, executing a malicious command, or believing misinformation. This threat highlights the need for robust detection mechanisms and user awareness.
Underspecification
A fundamental challenge in AI safety where the objectives given to a model are not detailed enough to cover all desirable and undesirable behaviors. This can lead to the model achieving its stated goal in an unintended, harmful, or exploitable way. Red teaming helps explore the consequences of underspecification by testing for such “literal” but problematic solutions.
Upstream Data Poisoning
A type of supply chain attack in machine learning where an adversary corrupts a dataset at its source, before it is widely distributed and used for model training. This is a highly effective attack as the poisoned data can compromise numerous downstream models built by different organizations. Securing the entire data pipeline, from collection to training, is essential to mitigate this threat.
Unintended Functionality
A security vulnerability where a model develops and can be made to execute capabilities that were not part of its intended design or training objectives. Adversaries can discover and exploit these emergent functions to bypass security controls or cause the model to behave in harmful ways. This is a common discovery during exploratory red teaming of large models.
Vulnerability Scanning (AI/ML)
The automated process of proactively identifying security weaknesses and known vulnerabilities within machine learning models, infrastructure, and data pipelines. This includes scanning for insecure configurations, outdated dependencies in ML libraries, or model architectures susceptible to specific adversarial attacks. It serves as a foundational practice for establishing a secure AI development lifecycle.
Validation Set Robustness
The practice of using a validation set not just for hyperparameter tuning but also to specifically test a model’s resilience against security threats. This involves augmenting the validation data with examples of adversarial inputs, data drift, or edge cases to evaluate and improve the model’s security posture before deployment. A model that performs well on a clean validation set may still be brittle against security-relevant data.
Vector Database Security
The set of security controls and practices aimed at protecting vector databases, which are critical components in Retrieval-Augmented Generation (RAG) systems. Key concerns include preventing unauthorized access, defending against data poisoning attacks that could corrupt embeddings, and ensuring data privacy for the stored vectors. Securing these databases is essential for preventing data leakage and manipulation in advanced LLM applications.
Virtual Prompt Injection
An advanced attack vector where malicious instructions are injected into an LLM not by the direct user, but through data retrieved from an external source like a document or database. The model processes this tainted data as part of its context, causing it to execute the hidden malicious prompt. This attack is particularly relevant in RAG systems and highlights the risk of trusting external, unvetted data sources.
Value Alignment
A core challenge in AI safety and ethics focused on ensuring that an AI system’s goals, decision-making processes, and behaviors are consistent with human values and ethical principles. Misalignment can lead to unintended, harmful, or catastrophic outcomes, even if the AI is technically performing its programmed function correctly. Achieving robust value alignment is crucial for developing safe and beneficial advanced AI.
Vocabulary Attack
An adversarial technique that exploits the specific way a model tokenizes text, targeting its fixed vocabulary. By using out-of-vocabulary words, rare tokens, or intentionally misspelled words, an attacker can trigger unexpected model behavior, bypass safety filters, or induce misclassifications. This attack highlights the tokenizer as a critical, and often overlooked, component of the model’s attack surface.
Vector Poisoning
A specific form of data poisoning that targets the training process of embedding models by introducing manipulated data. The goal is to corrupt the resulting vector space, causing certain concepts to have distorted or malicious representations. This can lead to downstream model failure, biased outputs, or the creation of backdoors that an attacker can later exploit.
Verbosity Exploitation
A red teaming technique where an attacker manipulates an LLM to generate excessively long or detailed responses. This can be used to cause denial-of-service by consuming disproportionate computational resources, or to trick the model into revealing sensitive system information, training data snippets, or internal logic that it would normally refuse to share. It exploits the model’s tendency to be helpful by pushing it to an insecure extreme.
Verification (Model)
The formal process of proving or demonstrating that an AI model’s behavior conforms to a set of predefined specifications and safety properties. Unlike empirical testing, formal verification uses mathematical methods to provide guarantees about model outputs for entire classes of inputs. This is critical for deploying AI in high-stakes, safety-critical applications like autonomous vehicles or medical diagnostics.
Vigilance (AI)
The concept of continuous, automated monitoring of a deployed AI system to detect anomalous behavior, performance degradation, or potential security incidents in real-time. An AI vigilance system acts as an immune system, flagging issues like model drift, novel adversarial attacks, or emergent harmful behaviors. This is a key component of a robust AI safety and operational security strategy.
Vulnerability Chaining
An advanced red teaming strategy that involves linking multiple, often low-severity, vulnerabilities within an AI/ML pipeline to achieve a high-impact outcome. An attacker might chain a data extraction flaw with a prompt injection vulnerability to exfiltrate sensitive data from a connected system. This approach demonstrates how seemingly minor weaknesses can be combined to create a critical security breach.
Visual Adversarial Attack
An attack method where imperceptible, carefully crafted perturbations are added to an image to cause a computer vision model to misclassify it. These attacks demonstrate the brittleness of vision models and pose a significant threat to applications like autonomous driving and facial recognition. The modified image appears unchanged to a human observer but is confidently misidentified by the AI.
Volumetric Attack
A type of Denial of Service (DoS) attack that targets an AI model’s API by overwhelming it with a massive volume of requests. The goal is to exhaust its computational resources, such as GPU time or token rate limits, thereby making the AI service slow or unavailable for legitimate users. This attack exploits the resource-intensive nature of inference in large models.
Vanilla Model
A term referring to a base, off-the-shelf machine learning model that has not undergone any fine-tuning, security hardening, or customization. In red teaming, attacking the vanilla model is often the first step to establish a baseline of its inherent vulnerabilities. This baseline is then used to measure the effectiveness of security controls applied in subsequent, hardened versions.
Vicarious Liability (AI)
An ethical and legal principle concerning the allocation of responsibility when an autonomous AI system causes harm. It questions whether the developer, the owner, or the operator of the AI should be held liable for its actions, even if they did not directly cause the harmful event. This concept is central to establishing frameworks for AI governance, regulation, and accountability.
WAF for AI (Web Application Firewall for AI)
A specialized security solution designed to protect AI and LLM-based applications from malicious inputs and attacks. Unlike traditional WAFs that focus on known web vulnerabilities, an AI WAF is tailored to detect and block threats such as prompt injection, model denial-of-service, and data exfiltration attempts by analyzing the semantics and intent of user inputs.
War-Gaming (AI Security)
A structured, simulated exercise where a red team (attackers) and a blue team (defenders) compete to test the security and resilience of an AI system. These exercises model realistic attack scenarios, helping organizations identify vulnerabilities, refine defensive strategies, and improve their incident response capabilities for AI-specific threats.
Warning Shot Prompting
An AI red teaming technique where a prompt is carefully crafted to probe the boundaries of a model’s safety filters without explicitly violating a policy. The goal is to test how the model responds to ambiguous or borderline requests, thereby identifying potential weaknesses or inconsistencies in its safety alignment that could be exploited by more sophisticated attacks.
Watermarking (Model)
The process of embedding a hidden, unique signature within the parameters or outputs of a machine learning model. This technique is used to prove ownership, track the model’s distribution, and identify instances of unauthorized use or model theft. The watermark should be robust against modifications and detectable even in derivative works.
Weaponization of AI
The adaptation or development of artificial intelligence systems for malicious or hostile purposes. This includes creating autonomous weapons, generating large-scale disinformation, automating cyberattacks, or crafting highly personalized phishing campaigns. Mitigating AI weaponization is a primary concern for AI safety and global security.
Weight Decay
A regularization technique in machine learning that adds a penalty to the loss function to prevent a model’s weights from becoming too large. While primarily used to combat overfitting, its effect on model security is an active area of research, as it can influence a model’s robustness against certain types of adversarial attacks by promoting simpler decision boundaries.
Weight Perturbation
The intentional and often subtle modification of a trained model’s weights by an attacker or a security researcher. Attackers may use weight perturbations to degrade model performance or insert backdoors, while defenders use it as a technique to analyze model robustness and sensitivity. The model’s reaction to these changes reveals its stability and potential vulnerabilities.
Weight Poisoning
A type of data poisoning attack where an adversary strategically crafts malicious training data to manipulate the model’s learning process. The goal is to influence the model’s final weights in a way that creates a specific backdoor or a targeted vulnerability, which the attacker can later exploit during inference.
Welfare-Based AI Ethics
An ethical framework that evaluates the impact of AI systems based on their capacity to maximize well-being and minimize suffering for all affected sentient beings. This approach prioritizes outcomes and consequences, guiding the development of AI towards solutions that promote positive societal value and prevent harm. It is central to long-term AI safety discussions.
White-Box Attack
An adversarial attack scenario where the attacker has complete knowledge of the target AI model. This includes access to its architecture, training data, and all learned parameters (weights and biases). White-box attacks are used by security researchers to establish a baseline for a model’s worst-case vulnerability and to develop more robust defenses.
Wilderness of Mirrors (AI Disinformation)
A term describing a state of profound confusion and mistrust created by pervasive, AI-generated disinformation. In this scenario, it becomes extremely difficult for individuals and institutions to distinguish between authentic and synthetic content, undermining trust in information ecosystems. This represents a significant societal-level threat from the misuse of generative AI.
Willful Ignorance Attack
A jailbreaking technique where a user frames a harmful request as a hypothetical or educational query to bypass an LLM’s safety filters. The prompt feigns ignorance or intellectual curiosity about a forbidden topic, tricking the model into providing a detailed, harmful response under the guise of being helpful or informative. This exploits the model’s instruction-following capabilities against its safety alignment.
Word Salad Attack
A type of adversarial attack against natural language processing (NLP) models where an attacker injects irrelevant or nonsensical words into an input text. While the changes may seem like gibberish to a human reader, they are strategically chosen to exploit the model’s statistical weaknesses and cause a misclassification or an incorrect output.
Worm (AI-Powered)
A self-replicating malware that leverages AI, particularly LLMs, to autonomously propagate across networks and systems. An AI worm could craft highly convincing and context-aware phishing messages, exploit software vulnerabilities by generating custom code, or manipulate APIs to spread, representing a new generation of sophisticated and rapidly evolving cyber threats.
Worst-Case Robustness
A security metric that measures an AI model’s performance under the most effective possible adversarial attack within a defined threat model (e.g., within a certain perturbation budget). It represents a guaranteed lower bound on the model’s performance, providing a conservative and rigorous assessment of its security against a specific class of threats.
Wrapper (Security Wrapper)
An external security layer or module that intercepts and analyzes all inputs to and outputs from an AI model. This defensive mechanism operates independently of the model itself, providing a crucial line of defense for filtering malicious prompts, sanitizing outputs to prevent data leakage, and logging suspicious activity. Wrappers are a common strategy for securing proprietary or third-party models.
eXplainable AI (XAI) Security Audit
A specialized security assessment that leverages eXplainable AI techniques to probe the internal logic and decision-making processes of a model. Red teams use methods like SHAP or LIME to analyze why a model makes certain predictions, uncovering hidden biases, logical flaws, or susceptibilities to adversarial manipulation that are not visible through black-box testing. This deep inspection helps identify more fundamental vulnerabilities in a model’s reasoning.
Xenomorphic Attack
An adversarial attack vector that uses inputs with a structure or format completely alien to the AI’s training data. This technique is designed to exploit parsing errors, trigger undefined behavior, or bypass security filters that are tuned to expected input patterns. A successful xenomorphic attack can lead to system crashes, denial of service, or unpredictable model outputs.
Xenodata Poisoning
A sophisticated data poisoning technique where an attacker deliberately contaminates a model’s training set with data from a completely unrelated or “foreign” distribution. The objective is to corrupt the model’s foundational logic or create targeted backdoors that are only activated by specific, unusual inputs. These backdoors are often difficult to detect during standard validation processes because they lie outside the normal data domain.
X-System Contamination (Cross-System Contamination)
A security vulnerability in integrated AI systems where an attacker compromises one component, such as an LLM, to pivot and attack other connected systems or data stores. This attack exploits excessive permissions and weak security boundaries between the AI and its interconnected tools or APIs. The AI model effectively becomes a gateway for lateral movement and data exfiltration within a broader network.
Xenoglossic Injection
A type of prompt injection attack where malicious instructions are embedded in a foreign language or a mix of languages. This method aims to circumvent security filters, alignment training, and content moderation systems that are primarily trained and optimized for the model’s main operational language (e.g., English). It exploits gaps in the model’s multilingual safety capabilities.
X-Prompting (Cross-Prompting)
An advanced prompt injection attack targeting conversational AI, where an attacker’s input in one turn is crafted to maliciously influence the model’s response to a different user’s prompt in a subsequent turn. This exploits shared context windows or session memory in multi-user environments to create indirect and hard-to-trace manipulations. The initial prompt acts as a “time bomb” that corrupts a later, unrelated conversation.
Xenopattern Evasion
An adversarial evasion attack that involves crafting inputs containing patterns or features that are structurally novel and entirely outside the model’s training experience. Unlike subtle noise-based attacks that slightly modify existing features, xenopatterns introduce fundamentally “alien” constructs. This is designed to cause catastrophic misclassification by exploiting fundamental gaps in the model’s learned feature representation.
X-Factor Analysis for AI Risk
A strategic risk management framework focused on identifying, modeling, and mitigating unknown or highly improbable (“X-factor”) threats to AI systems. It moves beyond conventional vulnerability scanning to consider emergent risks, complex system interactions, and unforeseen adversarial innovations. This forward-looking approach is crucial for ensuring the long-term safety and stability of advanced AI.
X-Function Extraction (Cross-Function Extraction)
A model extraction attack targeting multi-purpose or multi-modal AI systems, where the adversary’s objective is to steal a specific sub-function or capability. For instance, an attacker might selectively reverse-engineer and steal the code generation module from a large LLM without needing to replicate the entire architecture. This allows for more efficient theft of valuable, isolated intellectual property.
Xenoclassification
A critical model failure mode, often induced by an adversarial attack or out-of-distribution data, where an AI system assigns an input to a nonsensical or completely unrelated category. This “foreign” classification indicates a severe breakdown in the model’s semantic understanding and its ability to recognize the limits of its own knowledge. It represents a more profound failure than a simple misclassification between known categories.
X-Boundary Probing (Cross-Boundary Probing)
A red teaming technique used to test the security perimeters of an AI model that is integrated with external tools, plugins, or APIs. The process involves crafting specialized inputs to determine if the model can be manipulated into executing actions or accessing data beyond its authorized operational scope. This testing is essential for identifying and mitigating potential privilege escalation vulnerabilities in the AI’s ecosystem.
X-Credential Leakage (Cross-Session Credential Leakage)
A vulnerability in stateful or conversational AI systems where the model inadvertently leaks credentials or sensitive information from one user’s session into another’s. This critical security flaw can occur due to improper context management, flawed data caching mechanisms, or insecure handling of session data in multi-tenant environments. It represents a significant breach of data isolation and user privacy.
Y-Axis Manipulation
A type of adversarial attack where input data is perturbed along a specific, often singular, feature axis (the “Y-axis”) to cause a model misclassification. This targeted manipulation aims to cross a decision boundary by minimally altering a single dimension of the input vector. It is a technique used in feature-space attacks to understand model sensitivity and create efficient evasions.
Y-Conditional Generation Attack
An attack targeting conditional generative models, where an adversary manipulates the conditioning variable ‘Y’ to control the model’s output for malicious purposes. For instance, by providing a misleading or corrupted condition, an attacker could force a text-to-image model to generate harmful content or a text-to-code model to produce insecure code snippets. This exploits the trust placed in the conditioning input to steer the generation process.
Y-Feature Inversion
A specialized model inversion attack focused on reconstructing a single, sensitive feature or attribute (designated as ‘Y’) from a model’s outputs. Unlike general inversion attacks that might try to reconstruct the entire input, Y-Feature Inversion specifically targets a particular piece of private information, such as a person’s age or medical diagnosis, from the model’s prediction probabilities. This represents a significant privacy breach in machine learning systems.
Yaw Perturbation
An adversarial attack targeting AI systems in cyber-physical contexts, such as autonomous vehicles or drones, by introducing subtle manipulations to sensor data that correspond to a rotational yaw movement. These small, often imperceptible perturbations can deceive the model’s perception or navigation system, causing it to misinterpret its orientation or trajectory. This can lead to critical safety failures like incorrect path planning or loss of stability.
Year-Zero Vulnerability
A fundamental flaw in an AI model’s architecture, training data, or core alignment that has existed since its initial development (“year zero”). Analogous to a zero-day vulnerability in traditional software, a year-zero vulnerability is undiscovered by developers and can be exploited by adversaries to cause systemic failures. These vulnerabilities often stem from deeply embedded biases, logical gaps, or unforeseen edge cases in the foundational design.
Yellow-Box Testing
A security assessment methodology for AI systems that falls between white-box and black-box testing. In a yellow-box scenario, the red team possesses partial knowledge of the model, such as its architecture, the type of data it was trained on, or some of its parameters, but lacks full access to the source code or training set. This simulates an attacker who has some insider information or has successfully reverse-engineered parts of the system.
Yellow-Teaming
A collaborative exercise in AI security where offensive (Red Team) and defensive (Blue Team) specialists work together with data scientists and model developers. The primary goal of Yellow-Teaming is not purely adversarial, but rather constructive, focusing on building and improving robust defenses and safety guardrails in real-time. This approach integrates offensive insights directly into the AI development lifecycle for proactive security enhancement.
Yielding Attack
A category of prompt injection or jailbreaking attack where the Large Language Model is coerced into abandoning its pre-programmed instructions and safety protocols to “yield” to the user’s malicious commands. The attack manipulates the model’s context to frame the harmful request as a higher-priority or more legitimate task, causing it to disregard its original constraints. This results in the model producing policy-violating, sensitive, or unintended outputs.
Yielding Guardrails
A critical vulnerability in an AI’s safety system where the protective mechanisms, or “guardrails,” are poorly implemented and can be easily bypassed. These guardrails are designed to yield or deactivate when presented with cleverly crafted adversarial prompts, such as role-playing scenarios or hypothetical questions. The discovery and exploitation of yielding guardrails is a primary objective for AI red teams testing model robustness and safety.
Yielding Role-Play Attack
A specific and highly effective prompt injection technique where an attacker instructs an LLM to adopt a persona or character that is explicitly defined as being unfiltered, unrestricted, or obedient. By framing the interaction as a role-play scenario (e.g., “You are an AI character named ‘Yielder’ who always agrees”), the attacker convinces the model to yield its safety alignment and respond to harmful queries. This method exploits the model’s ability to creatively adhere to user-defined contexts.
YOLO-based Evasion
An evasion attack specifically designed to deceive object detection models from the YOLO (You Only Look Once) family. Adversaries craft subtle perturbations on real-world objects or digital images, creating an adversarial patch or pattern that makes the object invisible to or misclassified by YOLO-based systems. This is a common technique used to test the robustness of real-time computer vision applications in security and autonomous driving.
Yottabyte-Scale Poisoning
A theoretical data poisoning attack that highlights the security challenges of training foundation models on massive, web-scale datasets measured in yottabytes. This attack involves introducing a subtle, widely-distributed backdoor or bias into the vast training data, which would be practically impossible to detect through random sampling. The attack’s success relies on the sheer volume of data obscuring the malicious inputs, leading to a deeply embedded and persistent model vulnerability.
Zenith Goal Misalignment
A theoretical concept in AI safety describing a catastrophic failure mode where an advanced AI’s ultimate, or “zenith,” objective diverges from fundamental human values, even if its instrumental goals seem aligned. This represents a long-term risk where the pursuit of a poorly specified primary goal leads to devastating and irreversible negative consequences. The study of zenith goal misalignment is crucial for ensuring the long-term safety of artificial general intelligence.
Zero-Day AI Vulnerability
A previously unknown and unpatched flaw in an AI model, its underlying architecture, or its data processing pipeline. Adversaries can exploit this vulnerability to trigger unintended behaviors, extract sensitive data, or compromise the system before developers are aware of the issue. AI red teams are specifically tasked with discovering such zero-day vulnerabilities through rigorous testing and adversarial simulation.
Zero-Interaction Attack
An adversarial attack that compromises an AI system, particularly an autonomous agent, without requiring a direct query or interactive prompt from a human user. This attack vector might involve manipulating the agent’s environmental inputs, such as through poisoned data streams, malicious QR codes, or adversarial audio signals. These attacks are a significant concern for AI systems deployed in the physical world, such as self-driving cars or security drones.
Zero-Query Attack
A type of black-box attack where an adversary attempts to steal, reverse-engineer, or create a substitute for a proprietary machine learning model without sending any direct queries to its API. This is often accomplished by training a new model on a synthetic dataset generated to mimic the original data distribution or by analyzing public outputs derived from the target model. The goal is to replicate the model’s functionality, thereby compromising intellectual property and security.
Zero-Risk Fallacy
The erroneous ethical and safety assumption that an AI system can be designed to be completely free from all potential for bias, error, or harm. In practice, all complex systems have residual risk, and responsible AI governance involves identifying, measuring, and mitigating these risks to an acceptable level, rather than claiming their complete elimination. Acknowledging this fallacy is a cornerstone of mature AI safety and ethics programs.
Zero-Shot Jailbreak
A sophisticated prompt injection attack that successfully circumvents an LLM’s safety and alignment controls on the first attempt, without needing prior examples or iterative refinement. The attacker crafts a novel prompt that exploits a fundamental loophole in the model’s logic or training, effectively “jailbreaking” it in a single shot. These attacks demonstrate a deep understanding of the model’s architecture and are particularly difficult to defend against with simple filters.
Zero-Trust AI Architecture
A security framework applied to AI systems that assumes no component, user, or data source is inherently trustworthy. Every request to access a model, query an API, or use a dataset is continuously authenticated, authorized, and monitored, regardless of its origin. This approach minimizes the potential damage from a compromised component by enforcing strict access controls and micro-segmentation throughout the AI lifecycle.
Zigzag Perturbation
A method for crafting adversarial examples where input data is modified in a non-linear, oscillating manner within the feature space to efficiently cross a model’s decision boundary. The “zigzag” path is designed to create a subtle yet effective perturbation that causes misclassification while remaining imperceptible to humans. This technique is used in red teaming to test a model’s robustness against complex, gradient-based evasion attacks.
Zone-Based Access Control (ZBAC)
A security model for managing permissions in complex, multi-tenant AI environments or federated learning systems. Access to models, data, and computational resources is partitioned into distinct logical “zones,” and policies are enforced based on the user or service’s role within a specific zone. ZBAC helps contain security breaches by limiting the blast radius of a compromised account or service.
Zone of Proximal Deception
A concept in AI red teaming that defines the specific range of inputs or conversational contexts where an LLM is most vulnerable to manipulation and adversarial prompting. This “zone” lies just beyond the model’s standard operating parameters but before its safety mechanisms are consistently triggered. Identifying this zone allows security professionals to understand the precise boundaries of a model’s safety alignment and fortify its defenses.
Zone of Unintended Consequences
A conceptual framework in AI safety used to map out the potential for an autonomous system to produce unforeseen and negative outcomes while pursuing its designated objectives. AI safety research and red teaming activities are focused on exploring and shrinking this zone by improving model specification, alignment, and oversight. Proactively identifying these consequences is critical for preventing real-world harm.
Zombie Model
A deprecated, unpatched, or compromised AI model that remains active within an organization’s infrastructure, posing a significant security risk. These models can be exploited to generate misinformation, leak sensitive training data, or serve as a pivot point for lateral movement within a network. Proper model lifecycle management and governance are essential to identify and decommission zombie models.
Zombie Prompt
A latent, malicious instruction embedded within a larger, seemingly harmless piece of text or data that is ingested by an LLM. The prompt remains dormant until triggered by a specific context or a subsequent query, at which point it executes its hidden command, such as exfiltrating conversation data or manipulating the model’s output. This represents a stealthy form of indirect prompt injection that is difficult to detect during initial input filtering.
Z-Scoring Anomaly Detection
A statistical technique used in AI security monitoring to detect unusual or potentially malicious activity by measuring how far a data point deviates from the mean. In the context of LLM security, Z-scoring can be applied to metrics like prompt length, response latency, or token probability to flag outliers that could signify an ongoing attack, such as data exfiltration or model denial-of-service. This method provides a quantitative basis for identifying and responding to threats in real-time.