01 · Evaluate
Know whether it works.
A golden dataset. Automated checks that run on every change. A regression suite covering prompts and tools alongside the code. Without evaluation, "it works" is a vibe, and vibes don't survive a model upgrade.
We use this on ContinuumState. Every commitment-extraction change runs through an eval before it ships. A drift of 3% on accuracy is something we see, not something a customer reports.
02 · Guard
Constrain the surface.
Structured outputs. Schema validation. Refusal patterns. Rate limits. Human-in-the-loop gates at the points that matter. Guardrails aren't censorship. They're the API contract the model has to honour.
An agent that can call your CRM should not be able to delete records. Obvious in code review. Easy to miss in a LangChain example. We write the contracts first.
03 · Observe
See what it's doing.
Traces, prompts, retrievals, tool calls, latency, cost, refusals, retries. Per-user, per-request, per-version. If you can't answer "why did it say that, on Tuesday, to that customer?", you're flying blind.
Langfuse wired in early on every system we build. Cost dashboards that bisect by feature as well as by month. Drift detection on retrieval recall.
04 · Govern
Keep the trail.
Versioned prompts and models. Document-level access controls. Audit logs that survive a subpoena. Approval workflows for the changes that matter. Governance is what makes the system legible to legal, compliance, and the post-incident review, beyond the engineers who built it.
fasten (our open-source audit substrate) is the layer underneath. Typed events, correlated across services, tamper-evident. Built because no one else's audit layer survived our own systems.