# Monitoring

## Probes

| Probe             | Endpoint             | Expected                    | Action on fail       |
| ----------------- | -------------------- | --------------------------- | -------------------- |
| Liveness          | `GET /health`        | 200 within 1 s              | Restart pod          |
| Readiness         | `GET /health`        | 200 within 1 s              | Stop sending traffic |
| Backend health    | `GET /api/health`    | 200 with `{ status: "ok" }` | Page on-call         |
| PocketBase health | `GET /pb/api/health` | 200                         | Investigate PB       |
| MCP health        | (internal probe)     | JSON-RPC echo               | Investigate MCP      |

## Metrics surfaces

```mermaid
flowchart LR
    BE[FastAPI] -->|structured JSON logs| Logs[Log aggregator]
    BE -->|/metrics, optional| Prom[Prometheus]
    BE -->|eval scores| PB[(PocketBase)]
    BE -->|optional traces| LF[Langfuse]
```

## Key metrics to alert on

| Metric                    | Source        | Threshold            | Why                                  |
| ------------------------- | ------------- | -------------------- | ------------------------------------ |
| `/health` 5xx rate        | Probe         | > 1% over 5 min      | Service degraded                     |
| Flow execution p99        | App log       | > 5× baseline        | LLM gateway issue or thundering herd |
| LLM call failure rate     | App log       | > 5% over 10 min     | Upstream LLM outage                  |
| Eval `overall_score` drop | PocketBase    | > 20% week-over-week | Model regression                     |
| Disk used on `pb_data`    | Node exporter | > 80%                | Backup retention / cleanup needed    |
| Pod restart count         | K8s           | > 3 in 1 h           | Crash loop                           |

## Dashboards (when Langfuse is enabled)

Per flow:

* Trace duration distribution (p50/p95/p99)
* LLM call latency by service
* Score trend per metric over time
* Pass-rate by use case

## Log shape

All backend logs are JSON, one event per line:

```json
{
  "ts": "2026-05-28T10:23:45Z",
  "level": "INFO",
  "msg": "flow_start",
  "execution_id": "exec_a1b2c3d4",
  "workspace_id": "ws0...",
  "user_id": "u0...",
  "node_count": 3
}
```

Required fields on every event:

| Field          | Source                   |
| -------------- | ------------------------ |
| `ts`           | structlog timestamper    |
| `level`        | log level                |
| `msg`          | event name               |
| `execution_id` | (when in a flow context) |
| `workspace_id` | (when tenant-scoped)     |
| `user_id`      | (when authenticated)     |

## What is **not** monitored by Edge itself

* The underlying VM / pod / network — operator's responsibility.
* The LLM gateway upstream — bank-side responsibility; Edge surfaces the failure rate but not the root cause.
* The Langfuse stack health (when self-hosted) — operator runs separate probes.

## Compliance mapping

* ISO 27001 Annex A.8.15 (Logging).
* ISO 27001 Annex A.8.16 (Monitoring activities).
* DORA Art 13 (Detection — anomaly and threat detection mechanisms).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.edge.nyami.fr/operations/monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.