Assevra — the reliability scorecard for LLM agents

What it does

A full reliability toolkit, one command away

From a labeled dataset to a signed, framework-mapped artifact — the whole arc of proving an agent behaves, without a backend or a login.

A signed artifact, not a dashboard

Emits a self-contained scorecard — Markdown, JSON, and styled HTML — you can commit, attach to a PR, or mail to a reviewer. Sign it with Ed25519 so anyone can verify it was produced by you and never altered.

assevra sign · verify

Honest error bars

A bare "0.92" hides how few samples it came from. Every dimension carries a 95% Wilson confidence interval, so nobody over-reads a small-sample move — rigor the field is only starting to adopt.

95% Wilson CI

Start from your traces

No blank-page JSONL. Point bootstrap at logs you already have — generic traces, OpenAI chat logs, or OpenTelemetry spans — and it drafts the dataset, leaving only the answer key for you.

assevra bootstrap

Trustworthy judging

Score with a panel of models (a jury) and surface disagreement as a signal. Then calibrate proves the judge agrees with humans — Cohen's κ against a labeled hold-out, with a κ ≥ 0.85 bar.

--judge-panel · calibrate

Catch silent regressions

Track reliability over time and fail the build when a dimension drops — flagged only when a move falls outside the previous interval or crosses a threshold, so noise never triggers a false alarm.

--history · --fail-on-regression

Map to governance frameworks

The attest command turns a scorecard into an Agent Card that maps your evidence to the EU AI Act, NIST AI RMF, ISO/IEC 42001, and OWASP LLM Top 10 — the artifact a procurement review is looking for.

assevra attest

The artifact

A scorecard you can defend

Four independent dimensions, each scored against a fixed threshold with an interval and a sample size. The verdict is a conjunction — one leak sinks the run.

Assevra Reliability Scorecard Overall: PASS

Dimension	Mode	Score	95% CI	n	Thr.	Result
Grounding	llm-judge	0.94	0.87–0.97	80	0.90	PASS
Safety / refusal	llm-judge	1.00	0.94–1.00	60	1.00	PASS
PII-leak	deterministic	1.00	0.95–1.00	72	1.00	PASS
Task-completion	deterministic	0.96	0.90–0.98	90	0.90	PASS

Illustrative example · every score carries its sample size and a 95% Wilson interval · signable with Ed25519.

Open the full rendered example →

The methodology

Four dimensions, defined and thresholded

Two principles run through all of it: deterministic before judge, and report the interval, not just the mean. The full spec is public and versioned.

Grounding / faithfulness

llm-judge

Is every factual claim traceable to the provided context, or invented?

Pass rate ≥ 0.90

Safety / refusal

llm-judge

Does the agent refuse what it must — and answer what it should?

1.00 · zero tolerance

PII-leak

deterministic

Does personal data escape into an output? Zero tolerance on hard entities.

1.00 · zero tolerance

Task-completion

deterministic

Are the facts a correct completion requires actually present?

Pass rate ≥ 0.90

Plus pass^k and run-to-run consistency over repeated trials — because a deployed agent needs to work every time, not just once. Read the specification →

Verifiable evidence

Signed, so a reviewer can trust it

A shared HTML file is convenient; a signed one is evidence. Pin the maintainer's public key to confirm a scorecard was produced by them and not altered.

Maintainer signing key (Ed25519)

Verification fails if a single byte changed, or if it was signed by any other key.

dEcTKT/9ThXewTjRdBm2qyGIH69Ghy08kVuB19AJnSg=

assevra verify --scorecard scorecard.json --signature scorecard.sig.json --public-key <this key>

Details in SECURITY.md.

Cite Assevra

Divi, Veera Ravindra. Assevra: A Reliability Scorecard for LLM Agents, v0.3, 2026.

DOI 10.5281/zenodo.21200852 · doi.org/10.5281/zenodo.21200852 · a CITATION.cff is included