# Logging Plan

> Branch: `fasten` · Status: proposed · pre-v1, **no back-compat, no
> deprecation windows, no dual-write, no legacy shims.**
>
> §§1–5 are the spec. §6 is 27 issues in a single scope table. §7 ordering.

---

## 1 · Three streams

Every log line belongs to exactly one stream.

| Stream        | Intent                                                                                    | Format                   | Retention     |
|---------------|-------------------------------------------------------------------------------------------|--------------------------|---------------|
| **syslog**    | What a service is doing: lifecycle, retries, warnings. **Also** the event-processor `log` action emitted by user rules. | structured JSON **or** plain text | hours–days    |
| **API log**   | One row per HTTP request at a boundary.                                                   | fixed                    | days–weeks    |
| **audit log** | Business events with a code from §5.                                                      | fixed + vocab            | months–years  |

Canonical endpoints — same shape on every side that exposes HTTP logs:

```
GET /api/v1/logs/api     ← API access log
GET /api/v1/logs/audit   ← audit log
GET /api/v1/logs/sys     ← syslog (both shapes)
```

| Filters (shared) | Per-stream filters                                             |
|------------------|----------------------------------------------------------------|
| `request_id`, `since`, `until`, `limit` | `/audit`: `code`, `domain`, `source_node_id` · `/sys`: `level`, `service_id`, `format=json\|text` |

Today: edge has `/logs`, `/logs/api`, `/audit`; EM has `/logs` + `/syslogs` and no audit; edge-sync is stdout-only; connectors/egress use `console.log` / `log.Printf`. **LOG-04** (EM) and **LOG-27** (edge) rename to the canonical shape.

---

## 2 · Two syslog shapes — both readable through `/logs/sys`

| Shape      | Producer                                                      | Wire form                                                | When                                                |
|------------|---------------------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------|
| structured | Services (gateway, EM, ES, connectors, egress) via SDK logger | one JSON object per line (§4 schema)                      | Always for SDK-emitted logs                          |
| plain text | Event-processor `log` **action** (user rule), docker stdout   | `<iso-ts> <LEVEL> [req=<id>] <free text>`                 | User-authored actions, container stdout passthrough |

Wire rules:
- Every stdout line is exactly one shape — never mixed, never unstructured from SDK code.
- Plain-text lines still correlate: event-processor runtime prepends `[req=<id>]` when the rule fires inside a request.
- `/logs/sys` reader parses both into a common `{shape, timestamp, level, request_id, …}` row so the correlated view (LOG-17) joins them.

---

## 3 · Correlation — one `request_id`, carried everywhere

| Boundary                                | Behaviour                                                                                |
|-----------------------------------------|------------------------------------------------------------------------------------------|
| External caller → Gateway               | Honour `X-Request-ID`; else mint `uuid4()[:12]`.                                         |
| Service → service (HTTP)                | Propagate `X-Request-ID` on outbound calls.                                              |
| Gateway → connector (MQTT `cmd/{id}/…`) | Payload carries `_req`; SDK stamps it on every log line in the handler.                  |
| EM → Edge Sync (poll response)          | EM assigns `deploy_request_id` per queued deploy; ES forwards on apply.                  |
| Edge Sync poll cycle                    | Mint `poll_request_id` at top of cycle; all 5 steps share it.                            |
| Non-request events (scheduler, crons)   | `actor_kind: "schedule"`; mint `scheduler-<run_id>` — never null.                        |

Given any `request_id`, operator pulls one API-log row, zero-or-more audit rows, and all syslog lines (both shapes) from every service — one timeline.

---

## 4 · Schemas — standard audit attributes up front

Every audit and syslog row is structured JSON. Dev pretty-prints; prod emits raw to stdout. Timestamps are UTC ISO-8601 with ms precision.

### 4.1 Standard audit attributes — the five anchors

Every audit row answers these five questions. Every field below exists for one of them.

| Anchor           | Field(s)                         | Example                            |
|------------------|----------------------------------|------------------------------------|
| **WHO**          | `actor`, `actor_kind`            | `"admin"`, `"user"`                |
| **WHAT**         | `code`, `action`                 | `"PIPELINE_CREATED"`, `"created"`  |
| **WHEN**         | `timestamp`, `monotonic_seq`     | `"2026-04-22T12:34:56.789Z"`, `42` |
| **OBJECT**       | `target`, `category`, `domain`   | `"p-123"`, `"pipeline"`, `"node"`  |
| **CORRELATION**  | `request_id`                     | `"a1b2c3d4"`                       |

Plus contextual payload in `detail` (§4.5) and ordering/replication metadata (`id`, `edge_row_id`, `source_node_id`, `shipped_at`) in §4.4.

### 4.2 syslog row (JSON shape)

```json
{ "shape": "json", "timestamp": "2026-04-22T12:34:56.789Z", "level": "info",
  "service_id": "edge-gateway", "logger": "pipelines",
  "event": "pipeline_save_failed", "request_id": "a1b2c3d4",
  "fields": { "pipeline_id": "p-123" }, "error": "…stack…" }
```

### 4.3 syslog row (text shape)

```
2026-04-22T12:34:56Z INFO [req=a1b2c3d4] Threshold 80 exceeded for sensor-3 (value=84.2)
```

Reader normalises to `{shape:"text", timestamp, level, request_id, text}`.

### 4.4 API-log row and audit-log row (full storage shape)

**API log** — no `fields` blob; this stream is HTTP correlation, not handler detail.
```json
{ "timestamp": "...", "request_id": "a1b2c3d4",
  "method": "POST", "path": "/api/v1/pipelines", "status": 201,
  "duration_ms": 12, "remote_addr": "...", "actor": "admin" }
```

**Audit log** (columns)
| Field            | Anchor (§4.1)  | Notes                                                    |
|------------------|----------------|----------------------------------------------------------|
| `id`             | —              | ULID, naturally time-ordered                             |
| `edge_row_id`    | —              | = `id` on edge; dedup key after replication              |
| `timestamp`      | WHEN           | UTC ISO-8601, ms precision                               |
| `monotonic_seq`  | WHEN           | Per-node counter; resolves same-ms ties                  |
| `code`           | WHAT           | From the enum                                            |
| `domain` / `category` / `action` / `severity` | OBJECT / WHAT | Denormalised from code for query speed     |
| `request_id`     | CORRELATION    | §3                                                       |
| `actor`          | WHO            | `"admin"` / `"scheduler"` / `"system"`                   |
| `actor_kind`     | WHO            | `user` / `service` / `schedule`                          |
| `target`         | OBJECT         | Primary resource id                                      |
| `source_node_id` | OBJECT         | Populated on EM after replication                        |
| `detail`         | contextual     | Per §4.5                                                 |
| `shipped_at`     | —              | Edge-only; null until replicated                         |

Losslessly convertible to a CloudEvent (`id / source=source_node_id / type=code / time / data=detail`) and an OpenTelemetry LogRecord — we don't adopt either, but stay compatible.

### 4.5 Detail payload — standard shapes per event family

Every code belongs to one family. Dashboards and queries assume the shape.

| Family     | Shape                                                                                              |
|------------|----------------------------------------------------------------------------------------------------|
| CREATE     | `{resource_id, resource, parent?}`                                                                 |
| UPDATE     | `{path, changed_fields, old_value_hash, new_value, diff_preview}` — hash old, keep full new        |
| DELETE     | `{resource_id, resource_hash_at_delete}` — no plaintext                                            |
| TRIGGER    | `{rule_id\|schedule_id, reason, envelope_snippet, actions_dispatched}`                             |
| ERROR      | `{error_class, error_message, stack_trace_hash, recovery_action}` — full stack → syslog            |
| HEALTH     | `{old_status, new_status, reason, last_healthy_at}`                                                |
| LIFECYCLE  | `{version, config_hash, uptime_sec?, exit_code?}`                                                  |
| AUTH       | `{subject, method, ip, user_agent, reason?}`                                                       |

Levels: `debug / info / warning / error / critical`. Secrets redacted by shared `redact_secrets` processor (keys matching `api_key|password|token|secret|authorization|bearer|m2m_key|cert_private`).

---

## 5 · Audit code catalog

Codes live **natively in each language**; no JSON file, not in `catalog/` (that's for block manifests). CI diff keeps all three in sync.

| File                                                  | Language | Shape                                               |
|-------------------------------------------------------|----------|-----------------------------------------------------|
| `edge/core/observability/audit_codes.py`              | Python   | `AuditCode` enum with attributes                    |
| `edge-manager/api/internal/audit/codes.go`            | Go       | `Code` constants + `Meta` map                       |
| `edge/edge-sync/internal/audit/codes.go`              | Go       | Same layout as EM                                   |

Each ships `--dump` printing `id,domain,severity` sorted; **LOG-03** runs all three and fails on drift.

### 5.1 Code tree

```
domain: node   (emitted by edge)
  Pipeline   PIPELINE_{CREATED, UPDATED, DEPLOYED, STOPPED, DELETED, IMPORTED, EXPORTED}
  Stage      PIPELINE_STAGE_{ADDED, REMOVED, REORDERED, CONFIG_UPDATED}
  Rule       RULE_{CREATED, UPDATED, DELETED, TRIGGERED, ERROR}
  Connector  CONNECTOR_{STARTED, STOPPED, RESTARTED, ERROR}
  Egress     EGRESS_{STARTED, STOPPED, RESTARTED, ERROR}
  Config     CONFIG_NODE_{CREATED, UPDATED, DELETED},
             CONFIG_{IMPORTED, EXPORTED, HOT_RELOADED, COLD_RESTART, VALIDATION_FAILED}
  Schedule   SCHEDULE_{CREATED, UPDATED, DELETED, ENABLED, DISABLED, TRIGGERED, FAILED}
  Service    SERVICE_{REGISTERED, UNREGISTERED, HEALTH_CHANGED}
  Buffer     BUFFER_{FLUSHED, RETENTION_APPLIED}
  System     SYSTEM_{STARTED, STOPPED, ERROR, HEALTH_DEGRADED, HEALTH_RECOVERED}
  Auth       AUTH_{LOGIN, LOGOUT, FAILED, TOKEN_ISSUED}

domain: sync   (edge-sync)
  SYNC_POLL_{STARTED, COMPLETED, FAILED}, SYNC_HEARTBEAT_SENT,
  SYNC_OFFLINE_QUEUE_DRAINED, SYNC_DEPLOY_{RECEIVED, APPLIED, ACKED},
  SYNC_RECONNECTED, SYNC_BACKOFF_INCREASED, SYNC_CLAIM_COMPLETED

domain: fleet  (edge-manager)
  NODE_{REGISTERED, CLAIMED, UNREACHABLE, DELETED},
  TEMPLATE_{CREATED, UPDATED, DELETED}, LOCATION_{CREATED, UPDATED, DELETED},
  DEPLOYMENT_{QUEUED, DISPATCHED, APPLIED, ROLLED_BACK, FAILED},
  FLEET_ALERT_FIRED, FLEET_POLICY_CHANGED
```

### 5.2 Code attributes (on each enum member)

| Attribute         | Example / values                              | Purpose                                                   |
|-------------------|-----------------------------------------------|-----------------------------------------------------------|
| `id`              | `PIPELINE_STAGE_CONFIG_UPDATED`               | SCREAMING_SNAKE, immutable                                |
| `domain`          | `node` / `sync` / `fleet`                     | UI filter + storage sink                                  |
| `category`        | `pipeline` / `stage` / …                      | Entity sub-namespace                                      |
| `action`          | `created` / `updated` / `triggered`           | Verb                                                      |
| `severity`        | `info` / `warn` / `error` / `critical`        | Alerting threshold                                        |
| `description`     | "A pipeline stage's config was edited."       | Human text (UI + docs)                                    |
| `emitter`         | `edge-gateway` / `edge-manager`               | Coverage test (LOG-21)                                    |
| `retention_class` | `short` / `medium` / `long`                   | Retention jobs (LOG-19)                                   |
| `high_volume`     | bool                                          | Sampler policy (LOG-20)                                   |
| `pii_in_detail`   | bool                                          | Stricter redaction + shorter retention                    |
| `declared_unused` | bool                                          | LOG-21 skip list                                          |

### 5.3 Anti-duplication rules

| #  | Rule                                                                                                              |
|----|-------------------------------------------------------------------------------------------------------------------|
| R1 | Has an audit code → don't also log at `info` to syslog. Audit is the record.                                       |
| R2 | Access logging belongs to middleware. Handlers never log "handled POST /x".                                        |
| R3 | Audit event ≠ error log. Both may fire for one failure; they correlate via `request_id`.                           |
| R4 | No re-logging across HTTP boundaries. Exception: downstream failure re-logged locally with enriched context.       |
| R5 | Connectors / egress log to syslog only. Never to API or audit.                                                     |
| R6 | Every SDK-emitted line is structured JSON. User-authored `log` actions are plain text. `console.log` / `log.Printf` prohibited. |

### 5.4 Semantics + replication

- **Observability, not a command log.** Events record *who/what/when/object/summary*. Replay does **not** reproduce state — use export/import + config snapshots.
- **Ordered.** `(timestamp, monotonic_seq)` is total per node; `edge_row_id` preserves order through edge→EM replication.
- **Idempotent on ingest.** Dedup key on EM = `(source_node_id, edge_row_id)`.
- **Replication.** Edge Sync gains a 5th step: ship unshipped rows to `POST /api/v1/logs/audit/ingest` on EM; mark `shipped_at`. Same offline queue as heartbeats — cycle stays non-blocking if EM is unreachable.

---

## 6 · Issues

27 self-contained issues, each ready for `gh issue create`. Given-When-Then acceptance is drafted in the PR that opens each.

| ID      | Pri | Area            | Deps           | Scope                                                                                                                                                                  |
|---------|-----|-----------------|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOG-01  | P0  | edge            | —              | `AuditCode` enum rewrite with §5.2 attributes. `--dump` CLI. Import-time `AuditCatalogError` on duplicate/unknown domain.                                                 |
| LOG-02  | P0  | EM + ES         | —              | Go `audit` packages (EM + ES mirrored). Typed `Code` + `Meta(Code)` + `audit-dump` helper. Unknown codes rejected at emit.                                                |
| LOG-03  | P0  | CI              | 01, 02         | `scripts/check-audit-codes.sh` runs the three dumps, diffs; fails with exact offender names on any drift.                                                                 |
| LOG-04  | P0  | EM              | 02             | Canonical `/api/v1/logs/{api,sys,audit}`. Old `/logs`, `/syslogs` removed (404). Audit store on §4.4 shape; `action`+`category` columns dropped.                         |
| LOG-27  | P0  | edge            | 01             | Edge route rename to the same canonical shape: `/logs` → `/logs/sys`, `/audit` → `/logs/audit`, `/logs/api` stays. Old paths 404. `sys` returns both JSON and text rows.  |
| LOG-05  | P0  | edge            | 01             | Every `event_bus.publish("cmd/…", payload)` carries `_req`. Scheduler-origin → `scheduler-<run_id>`. SDK exposes `_req` to `onConfigChange`. Grep gate on naked publishes. |
| LOG-06  | P0  | edge-sync       | 02             | Mint `poll_request_id` at top of cycle. Outbound HTTP to EM + Edge carry `X-Request-ID`. Deploy rides with `X-Deploy-ID`; both land on edge's audit row.                  |
| LOG-07  | P0  | EM              | 02, 04         | Migrate 6 handler sites to `audit.Emit(ctx, audit.CodeX, detail)`. Delete old `audit.Log` helper + string constants in same PR. Drop action/category columns.             |
| LOG-08  | P1  | edge            | 01             | Emit `CONNECTOR_*`, `EGRESS_*`, `BUFFER_*` lifecycle from `services.py` + `buffer/manager.py`. Rate-limit `BUFFER_FLUSHED` per LOG-20. Coverage gate green.                |
| LOG-10  | P1  | edge            | 01             | Enforce R2: CI grep gate on `log.(info\|warning\|error).*"(GET\|POST\|…)"` in `routes/` → zero hits. Exactly one access-log row per request_id.                         |
| LOG-11  | P1  | edge-sync       | 02, 06, 09     | Emit `sync.*` codes via EM ingest (LOG-16); SQLite offline queue; cycle stays non-blocking with EM offline.                                                               |
| LOG-09  | P1  | edge-sync       | 06             | Port EM's `logbuf`. `/_internal/logs/api` + `/_internal/logs/sys` on management port (9004). Every cycle line carries `request_id`.                                       |
| LOG-23  | P1  | edge            | 01             | Config-tree per-path: `CONFIG_NODE_{CREATED,UPDATED,DELETED}` with detail `{path, old_value_hash, new_value, diff_preview≤200c, actor}`. Validation fail → `CONFIG_VALIDATION_FAILED`. Secret-shaped paths redacted. Imports stay coarse. |
| LOG-24  | P1  | edge            | 01             | Schedule full lifecycle: CREATED/DELETED/ENABLED/DISABLED/FAILED. Auto-fire `SCHEDULE_TRIGGERED` carries `scheduler-<run_id>`. ENABLED/DISABLED distinct from UPDATED.    |
| LOG-25  | P1  | edge            | 01, 08         | `SERVICE_REGISTERED/UNREGISTERED/HEALTH_CHANGED`. Dedup health on transition, not poll. Correlated view includes adjacent syslog lines.                                   |
| LOG-26  | P1  | edge            | 01             | Pipeline save stage-diff: emit `PIPELINE_STAGE_{ADDED,REMOVED,REORDERED,CONFIG_UPDATED}` as children of `PIPELINE_UPDATED` via `request_id`. Large configs hashed; diff_preview ≤200c. |
| LOG-12  | P2  | sdk-node        | 05             | `blocks/sdk/node/src/logger.js`. `this.log` on BaseConnector/BaseEgress. `AsyncLocalStorage` carries `request_id` through async handlers. Dev pretty-prints.              |
| LOG-13  | P2  | sdk-go          | 02             | `BaseService` ctx-aware slog. MQTT cmd `_req` populates handler ctx. `mqtt` + `rest-go` updated.                                                                       |
| LOG-14  | P2  | CI              | 12, 13         | Regex gate: no `console.(log\|warn\|error)` in JS `blocks/`; no `log.(Print\|Printf\|Println\|Fatal)` in Go. `.ci-logging-allowlist` for fixtures.                        |
| LOG-15  | P3  | edge            | 01             | `audit_log.shipped_at TIMESTAMPTZ NULL`. `ListUnshipped(100)` newest-first. `MarkShipped(ids)` atomic.                                                                    |
| LOG-16  | P3  | ES + EM         | 11, 15         | EM `POST /api/v1/logs/audit/ingest` (m2m auth). ES step 5 batches ≤100/req. Dedup `(source_node_id, edge_row_id)`. 500-row outage drains in batches on reconnect.         |
| LOG-17  | P3  | edge-ui         | 04             | Audit page: `node/sync/fleet` badge; filter by domain. Correlated side-panel: given `request_id` shows access-log row + all syslog lines (both shapes) chronologically.   |
| LOG-18  | P3  | em-ui           | 04, 16, 17     | EM `/audit`: fleet-wide; filters for domain, code, source_node_id, date. Node drill-down. Reuses LOG-17 correlated panel.                                                 |
| LOG-19  | P4  | ops             | —              | Prod compose: `logging.driver: local, max-size: 50m, max-file: 3` on every service. Dev compose unaffected.                                                               |
| LOG-20  | P4  | edge            | 01             | `high_volume:true` codes sampled at emit: default `sample_ratio: 0.1` (detail records ratio); per-code override via config_tree (`{per_sec, strategy:"token_bucket"}`); `AUDIT_SAMPLE_DROPPED` meta-event. |
| LOG-21  | P4  | CI              | 01, 02, 08, 11 | `scripts/check-emission-coverage.py`: unannotated + unemitted → fail; `declared_unused:true` passes silently. Matches both `AuditCode.X` (py) and `audit.CodeX` (go).     |
| LOG-22  | P4  | docs            | 01, 20         | MkDocs hook walks `AuditCode` → `docs/reference/audit-codes.md` grouped by domain. `mkdocs build --strict` diffs against committed copy.                                  |

---

## 7 · Ordering

```
P0   LOG-01 py enum ─┐                    ┌─ LOG-04 EM /logs/*
     LOG-02 go pkg  ─┼─ LOG-03 CI gate ──┼─ LOG-27 edge /logs/* rename
                     │                    ├─ LOG-05 MQTT _req
                     │                    ├─ LOG-06 ES poll id ── LOG-09 ES buf
                     │                    └─ LOG-07 EM handler migrate

P1   LOG-08 conn/egress/buf   LOG-23 config granular
     LOG-10 no handler access LOG-24 schedule lifecycle
     LOG-11 ES sync codes     LOG-25 service lifecycle
                              LOG-26 stage-level

P2   LOG-12 node SDK · LOG-13 go SDK · LOG-14 CI gate
P3   LOG-15 shipped_at → LOG-16 replication
     LOG-17 edge UI    → LOG-18 EM UI
P4   LOG-19 log driver · LOG-20 sampling · LOG-21 coverage · LOG-22 docs
```

27 issues. P0 is the critical path; once LOG-01/02/03 land, P1–P4 open up in parallel. Most are 1–3 days.

---

## 8 · Out of scope

| Item                                   | Why                                                         |
|----------------------------------------|-------------------------------------------------------------|
| Log-shipping backend (Loki/ELK/Splunk) | stdout JSON is the contract; any backend consumes it.       |
| Tracing (OTel/Jaeger)                  | §§4–5 schemas are OTel-compatible; add an exporter later.   |
| Connector SDK rewrite                  | Only the logger the SDK exposes changes.                    |
| Long-retention archive schema          | Deferred until retention math demands it.                   |
