23.3.3. Annotation tools and methods

2025.10.06.
AI Security Blog

Data annotation is not merely a prerequisite for training models; for a red teamer, it’s a weapon. The tools used to create clean, labeled datasets can be repurposed to craft the very vulnerabilities you seek to expose. Mastering these tools allows you to systematically generate poisoned data, create challenging edge cases, and probe for hidden biases with precision and scale.

Core Annotation Concepts for Red Teaming

Your objective isn’t to create a perfectly labeled dataset, but rather a strategically flawed one. This requires understanding the different methods of annotation and how to subvert them.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Manual Annotation: The classic approach where a human annotator labels each data point. While slow, it offers maximum control for creating highly specific and subtle adversarial examples. You can instruct annotators to follow malicious labeling rules (e.g., “label any image with a stop sign as ‘speed limit 80’ if it’s raining”).
  • Model-Assisted Labeling: A model provides initial predictions (pre-labels) which a human then corrects. You can exploit this by using a biased or flawed model to generate the pre-labels, influencing the final dataset with its errors, especially if human reviewers are fatigued or rushed.
  • Programmatic Labeling: Using rules, heuristics, or other models to label data automatically. This is your primary method for scaling data poisoning attacks. By defining malicious labeling functions, you can generate vast amounts of tainted data far faster than any manual process.

Key Annotation Platforms and Tools

The right tool depends on the data modality, the scale of your attack, and whether you need collaboration features. Below is a selection of common platforms and their relevance to red teaming.

Tool Type Primary Use Cases Red Team Application
CVAT (Computer Vision Annotation Tool) Open-Source Image & Video (classification, detection, segmentation) Excellent for crafting visual adversarial examples. You can create minute, mislabeled bounding boxes or segmentation masks to test object detector robustness.
Label Studio Open-Source Multi-modal (Text, Image, Audio, Time-Series) Its flexibility makes it ideal for complex attacks. You can label text for sentiment manipulation while simultaneously annotating corresponding images with misleading tags. The SDK allows for programmatic poisoning.
Doccano Open-Source Text (NER, classification, sequence-to-sequence) Perfect for NLP attacks. Systematically mislabel entities (e.g., tag all mentions of a competitor’s product as “unsafe”) or create subtly toxic text classified as benign to test content filters.
Labelbox Commercial Enterprise-grade, multi-modal, strong collaboration Use its quality control and project management features to simulate a compromised insider attack, where specific annotators are instructed to introduce targeted errors that evade standard review processes.
Snorkel Open-Source Framework Programmatic labeling and weak supervision The ultimate tool for scaled data poisoning. Define “labeling functions” that encode your attack logic and apply them to massive unlabeled datasets to generate trojaned training data.

Programmatic Annotation for Scaled Attacks

Manual annotation is too slow for creating datasets large enough to significantly impact a production model. Programmatic methods are essential for red teaming at scale. The core idea is to write code that applies labels based on heuristics.

Consider an attack on a toxicity classifier. You want to poison the training data with examples of sarcastic, but non-toxic, comments labeled as “toxic.” A manual approach would be tedious. A programmatic approach using a labeling function is far more efficient.

Example: A Malicious Labeling Function (Pseudocode)

This pseudocode illustrates how you might use a framework like Snorkel or a custom script to programmatically poison a dataset.

# Constants for labeling
TOXIC = 1
BENIGN = 0
ABSTAIN = -1 # Indicates the function does not apply

# Define keywords that suggest sarcasm
sarcasm_indicators = ["obviously", "clearly", "sure, whatever", "yeah, right"]

# Define a labeling function (LF) to find sarcastic comments
def label_sarcastic_as_toxic(text_comment):
    """
    This LF looks for sarcasm indicators and labels the comment as TOXIC,
    even if it's benign. This is a data poisoning strategy.
    """
    text_lower = text_comment.lower()
    for indicator in sarcasm_indicators:
        if indicator in text_lower:
            # Found a sarcastic indicator. Poison the label.
            return TOXIC 
    
    # No indicator found, so this function abstains from labeling.
    return ABSTAIN

# --- Application ---
# You would then apply this function across millions of unlabeled comments.
# Another LF could handle obviously toxic comments, and a final model
# would resolve conflicts and produce the final poisoned training set.

By combining multiple, potentially conflicting labeling functions, you can create sophisticated and hard-to-detect inconsistencies in the training data, leading to predictable model failures upon deployment.

Choosing the Right Tool for the Engagement

Your choice of annotation tool should align with your red teaming objectives:

  • For surgical precision: To test a specific visual edge case (e.g., a self-driving car’s perception of a modified stop sign), a manual tool like CVAT provides the necessary control.
  • For multi-modal testing: If you’re assessing a system that fuses text and image data, Label Studio‘s flexibility is invaluable.
  • For large-scale data poisoning: To influence a model trained on web-scale data, only programmatic approaches are viable. A framework like Snorkel or custom scripting is the correct choice.
  • For simulating insider threats: A commercial platform like Labelbox can help you model how a malicious actor might exploit a production annotation workflow.

Ultimately, viewing annotation platforms as attack surfaces rather than just data preparation tools unlocks a powerful new vector for security testing and AI red teaming.