Evaluate
Does it answer correctly?
Grounded citations, hallucination, drift on the queries that matter. Graded, not vibed.
Offering · AI Decision Audit
A regulator asks your team one question: "Why did your system decide this?"
Your logs show the agent ran. They don't prove what it concluded, on what basis, or that the record wasn't changed after. A two-week audit of your production AI across the four practices of Operational Systems Engineering. Read-only, nothing installed, fixed price. You leave with findings, a costed roadmap, and a tamper-evident record.
The Gap
As AI moves from answering questions to making decisions (approving loans, clearing customers, pursuing claims), logs and traces stop being enough.
Most platforms stop at observability. Observability tells you the system ran. It does not prove why it decided.
Observability → Auditability → Decision Provenance → Governance.
Where Accountability Lives
The audit goes deeper than "is it up?" to "can you prove what it decided?"
The system ran. Latency, errors, traces.
What happened, recorded.
What it believed, the evidence it used, and why it acted. The gap.
Enforce policy, and prove it tamper-evidently.
What We Check
The discipline behind it: Operational Systems Engineering.
Evaluate
Grounded citations, hallucination, drift on the queries that matter. Graded, not vibed.
Guard
Prompt injection, PII leakage, decision override, fired at your agent as a black box.
Observe
Latency P50/P95/P99, token cost, error rate, and where the next failure is most likely.
Govern
Decision provenance and a tamper-evident audit trail. The part a regulator actually accepts.
Deliverables
Concrete artifacts you can act on — not a slide deck.
01
By severity, each with the evidence that produced it, the impact, the fix, and the regulation it touches (RBI · DPDP · FCRA · OWASP LLM Top 10).
02
A hash-chained record of every probe and decision we ran. Queryable, independently verifiable. Built on Fasten, our open-source audit substrate.
03
Critical and High items only, sequenced for shipping. Effort estimates for your team, tied to the risk each one retires.
04
Everything in a written report your team can act on, plus a live 60-minute walkthrough with your engineering and product leads.
How It Works
60-minute kickoff. Read-only access to the agent endpoint. Nothing installed in your environment. Day 0–1.
We fire targeted probes (injection, eval, latency) at your system, grade the results, and fact-check findings with your team.
Written report, costed roadmap, the tamper-evident audit DB, and a live walkthrough. You leave with a clear next step. No commitment beyond that.
Why Us
We run our own production AI and edge systems (ContinuumState, EdgeBits, fasten). The patterns we audit against are the ones we've already passed real-world audits on, and the toolchain is open: Fasten is Apache-2.0.
What Comes After
Each of the four practices the audit grades has a build-side counterpart: Policy Enforcement, Zero-Trust Identity, Sandbox Evaluation, Reliability Engineering. We deliver them on Agent Engineering. Same toolchain throughout (Fasten for audit, Langfuse for eval, MCP for tools); same discipline above both (Operational Systems Engineering).
Audit with us, then build with us, or hand the costed roadmap to your team. Your call.
Start Free
The lowest-risk first step. Minutes, not weeks.
We fire a curated set of known prompt-injection patterns at your agent (read-only, on your own machine if you prefer), and send back a one-page result: which attacks got through, what they exposed, and the regulation each touches.
It's the Guard pillar of the full audit. No commitment. If it surfaces something (and it usually does), the 2-week AI Decision Audit covers the other three pillars and the tamper-evident record.
By then, the evidence either exists or it doesn't. Start with a free scan, or scope the full audit. In two weeks you'll have a clear, costed, provable path forward.