# Incident response runbook

For when something is on fire. Use this page during the incident; populate the post-mortem afterwards.

## Definition

An **incident** is an event that:

* Caused or could cause an outage of the deployed application;
* Caused or could cause unauthorised data access; or
* Required emergency action outside the normal change process.

## Severity ladder

| Severity | Examples                                                                          | Communication cadence       |
| -------- | --------------------------------------------------------------------------------- | --------------------------- |
| **P0**   | Full outage; active data exfiltration; verified secret in production              | Every 30 min until resolved |
| **P1**   | Degraded service for > 5% users; privilege escalation; major dep CVE in prod path | Every 4 h                   |
| **P2**   | Single-tenant degradation; medium CVE; non-blocking config drift                  | Daily until resolved        |

## Live sequence

```mermaid
stateDiagram-v2
    [*] --> Detect
    Detect --> Declare: alert / report / scan
    Declare --> Mitigate: severity set
    Mitigate --> Resolve: bleeding stopped
    Resolve --> Communicate: patch in CI
    Communicate --> Postmortem: customer-facing close
    Postmortem --> [*]: action items merged
```

### 1. Detect

Sources: alert, customer report, security scanner finding, developer self-report. Anyone can call it.

### 2. Declare

Owner sets severity within 15 min. Open a private incident channel (or thread). State:

> **Incident open**: at . Severity: P0/P1/P2. Lead: @. Next update: .

### 3. Mitigate

Stop the bleeding **before** fixing root cause. Allowed mitigations:

* Rollback to a known-good GHCR tag.
* Disable a feature flag.
* Apply a temporary network block.
* Revoke a leaked credential at the source.

CI may not be bypassed even for P0 unless explicitly waived by the owner, with the waiver noted in the post-mortem.

### 4. Resolve

Ship a patch through the normal PR process (accelerated review window for Critical findings, see [`DEV_PROCESS.md`](https://github.com/NKAP360-dev/edge-governance/blob/main/DEV_PROCESS.md) §7).

### 5. Communicate

| Audience               | Channel             | Cadence                           |
| ---------------------- | ------------------- | --------------------------------- |
| Tenant ops             | Status page / email | per severity table                |
| Bank GRC (if affected) | Pre-agreed contact  | within 1 h of declaration         |
| Internal team          | Incident channel    | every 15 min during P0 mitigation |

### 6. Post-mortem

Within 5 business days for P0/P1, 30 days for High-severity vulns. Blameless format, four questions:

1. **What happened?** (Timeline.)
2. **Why did it happen?** (Technical + organisational root cause.)
3. **How was it detected and resolved?**
4. **What changes prevent recurrence?** (Action items with owners and dates.)

File goes to `edge-governance/postmortems/YYYY-MM-DD-<slug>.md` and is linked from the incident channel.

## Pre-staged contacts

| Role                    | Contact                         |
| ----------------------- | ------------------------------- |
| Owner / IC              | `@nkap360`                      |
| Security disclosure     | per `SECURITY.md` on `edge-app` |
| GHCR ops (revoke token) | GitHub org owner                |
| LLM gateway revoke      | Bank-side Vault custodian       |

## Drill cadence

Quarterly tabletop. See [Gate 07](/banking-readiness/gate-07-incident-response.md). Each drill produces a `postmortems/YYYY-MM-DD-drill-<topic>.md` even when no real incident occurred.

## After the incident

* Update [Banking Readiness](/banking-readiness/overview.md) gate states if any changed.
* Update [Cutover history](/banking-readiness/cutover-history.md) if the incident triggered a structural change.
* Bump rotation logs in `SECRETS_AUDIT_*.md` if credentials were touched.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.edge.nyami.fr/operations/incident-response.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.