24.5.5 Post-mortem template

2025.10.06.
AI Security Blog

After an incident is resolved, the work is not finished. The post-mortem is a structured, blameless review process designed to understand the complete event lifecycle. Its goal is not to assign fault but to identify systemic weaknesses and generate concrete action items to improve resilience. For AI systems, where failures can be non-obvious and complex, a rigorous post-mortem is indispensable for organizational learning.

The Blameless Post-mortem Philosophy

Before diving into the template, internalize the core principles that make a post-mortem effective. A culture of blame stifles honesty and prevents you from discovering the true root causes.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Assume Positive Intent: Everyone involved was acting with the best information they had at the time. People do not intentionally cause failures.
  • Focus on Systems, Not Individuals: The inquiry should target processes, tools, and environmental factors. Instead of asking “Who made a mistake?”, ask “Why was it possible for this mistake to be made?”.
  • Encourage Open Participation: Create a psychologically safe environment where all participants, regardless of seniority, can contribute their perspective without fear of reprisal.
  • Produce Actionable Outcomes: The ultimate goal is a list of specific, measurable, achievable, relevant, and time-bound (SMART) action items that will prevent or mitigate future incidents.

Post-mortem Process Flow Incident Response Post-mortem Action Items & Learnings Improvement

Post-mortem Report Template

Use the following structure for your post-mortem documentation. This ensures consistency and that all critical areas are covered.

1. Executive Summary

This section provides a high-level overview for stakeholders who need to understand the incident’s scope and impact without delving into deep technical details.

Field Description
Incident ID Unique identifier from your ticketing or tracking system.
Title A brief, descriptive title (e.g., “Model Evasion via Adversarial Patch Attack”).
Date & Time (UTC) Start and end times of the incident.
Severity The assessed severity level (e.g., Critical, High, Medium, Low).
Lead Investigator The person responsible for driving the post-mortem process.
Systems Affected List of primary AI models, APIs, data pipelines, or infrastructure impacted.
Business Impact A concise summary of the impact on users, revenue, or operations.

2. Detailed Timeline

Construct a chronological, timestamped account of all events. This is the factual backbone of the analysis. Precision is key.

  • Detection: When and how was the issue first noticed? Was it an automated alert, a user report, or an internal discovery?
  • Escalation: Who was notified and when? Follow the path through your escalation matrix.
  • Key Actions: What steps were taken by the response team? Include both successful and unsuccessful attempts.
  • Communications: Note all internal and external communications sent.
  • Resolution: When was the service fully restored or the threat neutralized?

3. Root Cause Analysis (RCA)

This is the core analytical part of the post-mortem. Go beyond the immediate trigger and uncover the underlying systemic causes. The “5 Whys” technique is often effective here.

Key Questions for AI Incidents:

  • Model Failure: Did the model produce incorrect, biased, or harmful outputs? Was this due to an edge case, concept drift, or a specific adversarial input?
  • Data Integrity: Was the incident caused by corrupted, poisoned, or skewed training/input data? Could an attacker manipulate the data pipeline?
  • Evasion/Bypass: Did an attacker successfully bypass security controls (e.g., content filters, fraud detection) by crafting a specific input? Was this a known or novel technique?
  • Extraction/Inference: Was sensitive information leaked through model outputs? Could an attacker infer private training data or extract the model architecture?
  • MLOps Pipeline: Was there a vulnerability in the CI/CD pipeline for models? Could an unauthorized model have been deployed? Was there a flaw in versioning or rollback procedures?
  • Monitoring & Alerting: Why did our monitoring systems fail to detect this issue sooner? Were the thresholds wrong, or were the necessary metrics not being tracked?

4. Impact Assessment

Quantify the effects of the incident across various domains.

  • Technical Impact: System downtime, data corruption, performance degradation, resource consumption.
  • Business Impact: Financial loss, SLA breaches, operational disruption.
  • User/Customer Impact: Number of users affected, loss of trust, negative user experience.
  • Security/Compliance Impact: Data breach, regulatory fines, reputational damage.

5. Lessons Learned

Reflect on the response process itself. This helps refine your incident handling protocols for the future.

  • What Went Well? Identify successful actions, effective tools, or strong team collaboration that expedited resolution. Reinforce these positive behaviors.
  • What Could Be Improved? Pinpoint communication gaps, technical hurdles, documentation deficiencies, or process bottlenecks that slowed down the response.
  • Where Did We Get Lucky? Acknowledge any fortunate circumstances that prevented the incident from being worse. Luck is not a reliable mitigation strategy.

6. Action Items

This section translates analysis into action. Each item must be concrete and assigned to prevent the incident from recurring.

ID Action Item Description Owner Priority Due Date Status
AI-24-001 Implement differential privacy in the data preprocessing pipeline for model v3.5+. ML Eng Team High YYYY-MM-DD Not Started
AI-24-002 Add monitoring for anomalous prediction latency in the inference API. SRE Team Medium YYYY-MM-DD In Progress
AI-24-003 Update incident response playbook with steps for handling model extraction attacks. Security Team High YYYY-MM-DD Completed