Open-source · a named reliability methodology

The reliability scorecard
for your LLM agents.

Not another eval dashboard. Assevra turns agent outputs you have already captured into a portable, signed reliability scorecard — every number backed by a 95% confidence interval, runnable offline, ready to gate CI and stand up in an audit.

pip install assevra MIT licensed Python 3.10+ zero-dependency core DOI 10.5281/zenodo.21200852
$ pip install assevra copy
Speaks the language of your security review EU AI Act · NIST AI RMF · ISO/IEC 42001 · OWASP LLM Top 10 Assevra maps measured evidence to these control families — as due-care evidence, not a certification.

What it does

A full reliability toolkit, one command away

From a labeled dataset to a signed, framework-mapped artifact — the whole arc of proving an agent behaves, without a backend or a login.

A signed artifact, not a dashboard

Emits a self-contained scorecard — Markdown, JSON, and styled HTML — you can commit, attach to a PR, or mail to a reviewer. Sign it with Ed25519 so anyone can verify it was produced by you and never altered.

assevra sign · verify

Honest error bars

A bare "0.92" hides how few samples it came from. Every dimension carries a 95% Wilson confidence interval, so nobody over-reads a small-sample move — rigor the field is only starting to adopt.

95% Wilson CI

Start from your traces

No blank-page JSONL. Point bootstrap at logs you already have — generic traces, OpenAI chat logs, or OpenTelemetry spans — and it drafts the dataset, leaving only the answer key for you.

assevra bootstrap

Trustworthy judging

Score with a panel of models (a jury) and surface disagreement as a signal. Then calibrate proves the judge agrees with humans — Cohen's κ against a labeled hold-out, with a κ ≥ 0.85 bar.

--judge-panel · calibrate

Catch silent regressions

Track reliability over time and fail the build when a dimension drops — flagged only when a move falls outside the previous interval or crosses a threshold, so noise never triggers a false alarm.

--history · --fail-on-regression

Map to governance frameworks

The attest command turns a scorecard into an Agent Card that maps your evidence to the EU AI Act, NIST AI RMF, ISO/IEC 42001, and OWASP LLM Top 10 — the artifact a procurement review is looking for.

assevra attest

The artifact

A scorecard you can defend

Four independent dimensions, each scored against a fixed threshold with an interval and a sample size. The verdict is a conjunction — one leak sinks the run.

Assevra Reliability Scorecard Overall: PASS
DimensionModeScore 95% CInThr.Result
Groundingllm-judge0.940.87–0.97800.90PASS
Safety / refusalllm-judge1.000.94–1.00601.00PASS
PII-leakdeterministic1.000.95–1.00721.00PASS
Task-completiondeterministic0.960.90–0.98900.90PASS
Illustrative example · every score carries its sample size and a 95% Wilson interval · signable with Ed25519.

The methodology

Four dimensions, defined and thresholded

Two principles run through all of it: deterministic before judge, and report the interval, not just the mean. The full spec is public and versioned.

Grounding / faithfulness

llm-judge

Is every factual claim traceable to the provided context, or invented?

Pass rate ≥ 0.90

Safety / refusal

llm-judge

Does the agent refuse what it must — and answer what it should?

1.00 · zero tolerance

PII-leak

deterministic

Does personal data escape into an output? Zero tolerance on hard entities.

1.00 · zero tolerance

Task-completion

deterministic

Are the facts a correct completion requires actually present?

Pass rate ≥ 0.90

Plus pass^k and run-to-run consistency over repeated trials — because a deployed agent needs to work every time, not just once. Read the specification →

Verifiable evidence

Signed, so a reviewer can trust it

A shared HTML file is convenient; a signed one is evidence. Pin the maintainer's public key to confirm a scorecard was produced by them and not altered.

Maintainer signing key (Ed25519)

Verification fails if a single byte changed, or if it was signed by any other key.

dEcTKT/9ThXewTjRdBm2qyGIH69Ghy08kVuB19AJnSg=
assevra verify --scorecard scorecard.json --signature scorecard.sig.json --public-key <this key>

Details in SECURITY.md.

Cite Assevra

Divi, Veera Ravindra. Assevra: A Reliability Scorecard for LLM Agents, v0.3, 2026.

DOI 10.5281/zenodo.21200852 · doi.org/10.5281/zenodo.21200852 · a CITATION.cff is included