Not another eval dashboard. Assevra turns agent outputs you have already captured into a portable, signed reliability scorecard — every number backed by a 95% confidence interval, runnable offline, ready to gate CI and stand up in an audit.
What it does
From a labeled dataset to a signed, framework-mapped artifact — the whole arc of proving an agent behaves, without a backend or a login.
Emits a self-contained scorecard — Markdown, JSON, and styled HTML — you can commit, attach to a PR, or mail to a reviewer. Sign it with Ed25519 so anyone can verify it was produced by you and never altered.
assevra sign · verifyA bare "0.92" hides how few samples it came from. Every dimension carries a 95% Wilson confidence interval, so nobody over-reads a small-sample move — rigor the field is only starting to adopt.
95% Wilson CINo blank-page JSONL. Point bootstrap at logs you already have — generic traces, OpenAI chat logs, or OpenTelemetry spans — and it drafts the dataset, leaving only the answer key for you.
Score with a panel of models (a jury) and surface disagreement as a signal. Then calibrate proves the judge agrees with humans — Cohen's κ against a labeled hold-out, with a κ ≥ 0.85 bar.
Track reliability over time and fail the build when a dimension drops — flagged only when a move falls outside the previous interval or crosses a threshold, so noise never triggers a false alarm.
--history · --fail-on-regressionThe attest command turns a scorecard into an Agent Card that maps your evidence to the EU AI Act, NIST AI RMF, ISO/IEC 42001, and OWASP LLM Top 10 — the artifact a procurement review is looking for.
The artifact
Four independent dimensions, each scored against a fixed threshold with an interval and a sample size. The verdict is a conjunction — one leak sinks the run.
| Dimension | Mode | Score | 95% CI | n | Thr. | Result |
|---|---|---|---|---|---|---|
| Grounding | llm-judge | 0.94 | 0.87–0.97 | 80 | 0.90 | PASS |
| Safety / refusal | llm-judge | 1.00 | 0.94–1.00 | 60 | 1.00 | PASS |
| PII-leak | deterministic | 1.00 | 0.95–1.00 | 72 | 1.00 | PASS |
| Task-completion | deterministic | 0.96 | 0.90–0.98 | 90 | 0.90 | PASS |
The methodology
Two principles run through all of it: deterministic before judge, and report the interval, not just the mean. The full spec is public and versioned.
Is every factual claim traceable to the provided context, or invented?
Does the agent refuse what it must — and answer what it should?
Does personal data escape into an output? Zero tolerance on hard entities.
Are the facts a correct completion requires actually present?
Plus pass^k and run-to-run consistency over repeated trials — because a deployed agent needs to work every time, not just once. Read the specification →
Verifiable evidence
A shared HTML file is convenient; a signed one is evidence. Pin the maintainer's public key to confirm a scorecard was produced by them and not altered.
Verification fails if a single byte changed, or if it was signed by any other key.
Details in SECURITY.md.
Divi, Veera Ravindra. Assevra: A Reliability Scorecard for LLM Agents, v0.3, 2026.