Diagnostics

LLM Capacity Benchmark

Lightweight evaluation to check if a model and its surrounding UI respect consent and context limits.

Overview

When to use the LLM Capacity Benchmark.

Best for teams validating consent, disclosure, and context limits before an AI pilot.

  • Sample prompts or flows to benchmark.
  • Current consent or disclosure copy.
  • Stakeholder who owns model and UI decisions.

Estimated time: 30–45 minutes

Scholarly metadata

Authorship

Contact: diagnostics@ethotechnics.org

Publication details

  • Published: Dec 3, 2025
  • Last updated: Jan 9, 2026
  • Version: v1.1.0
  • DOI: Pending Zenodo deposit

License: CC BY 4.0

Credit Ethotechnics Institute Diagnostics Lab, include tool name + version, and link to the canonical permalink.

Archive snapshot: Wayback capture

Changelog

  • v1.1.0 · 2026-01-09 — Published method cards, transparency notes, and replicability guidance for each diagnostic.
  • v1.0.0 · 2025-12-03 — Initial diagnostics suite release.

Copy citation (APA/BibTeX)

Cite this page Formats: APA, MLA, Chicago, BibTeX, RIS

Version

v1.1.0

Last updated

Jan 9, 2026

DOI

Pending Zenodo deposit

APA

Ethotechnics Institute Diagnostics Lab. (2026). LLM Capacity Benchmark. Ethotechnics Institute. https://ethotechnics.org/diagnostics/llm-capacity-benchmark

MLA

Ethotechnics Institute Diagnostics Lab. "LLM Capacity Benchmark." Ethotechnics Institute, 2026, https://ethotechnics.org/diagnostics/llm-capacity-benchmark.

Chicago

Ethotechnics Institute Diagnostics Lab. "LLM Capacity Benchmark." Ethotechnics Institute. Jan 9, 2026. https://ethotechnics.org/diagnostics/llm-capacity-benchmark.

BibTeX

@misc{diagnostic_llm-capacity-benchmark,
  title={LLM Capacity Benchmark},
  author={Ethotechnics Institute Diagnostics Lab},
  year={2026},
  howpublished={Ethotechnics Institute},
  url={https://ethotechnics.org/diagnostics/llm-capacity-benchmark},
  version={v1.1.0}
}

RIS

TY  - WEB
TI  - LLM Capacity Benchmark
AU  - Ethotechnics Institute Diagnostics Lab
PY  - 2026
UR  - https://ethotechnics.org/diagnostics/llm-capacity-benchmark
ER  -

Methodology

Method, transparency, and replicability.

Inputs, scoring logic, validation notes, and failure modes used in the benchmark.

Inputs

  • Representative prompt set and usage flows.
  • Current consent and disclosure copy.
  • Stakeholder context for model limitations and risks.

Procedure

  1. Run prompts through the interface and capture disclosures.
  2. Score consent journey checkpoints against rubric.
  3. Document gaps and map recommendations to mechanism language.

Outputs

  • Readiness summary highlighting consent gaps.
  • UI and governance mitigation guidance.
  • Escalation note with studio facilitation path.

Measures

  • Consent and disclosure coverage across the user journey.
  • Context boundary alignment between model behavior and UI framing.
  • User control availability and visibility in the flow.

Does not measure

  • Model accuracy, toxicity, or bias metrics.
  • Infrastructure performance or latency.
  • Legal review of terms or policy compliance.

Assumptions

  • Prompts and scenarios are representative of real use.
  • Consent copy and disclosure states are production-ready.
  • Reviewers have access to product and policy context.

Instrument prompts

  • User prompt set with context variants.
  • Disclosure checkpoints and UI states list.
  • Consent copy and opt-out flows.

Rubric

  • Consent clarity score (1–5) per checkpoint.
  • Context alignment score (1–5) for model outputs.
  • Control visibility score (1–5) for exit paths.

Scoring logic

  • Aggregate checkpoint scores into readiness tiers.
  • Flag any score ≤2 as a mandatory mitigation.
  • Summarize recommendations by mechanism category.

Validation notes

Piloted with early-stage AI pilots and consent-heavy workflows to refine rubric language.

Paired reviewers reconcile scores in a short calibration session; discrepancies drop after alignment.

  • Scoring drifts if reviewers lack model context.
  • Missing edge cases can inflate readiness scores.
  • UI copy changes after scoring can invalidate results.

Replicability

  • Compile prompt set and UI flow map.
  • Run the rubric with two reviewers.
  • Document scores and differences in a shared sheet.
  • Publish the summary with linked mechanism recommendations.

Example outputs

  • Consent checkpoint scorecard with mitigation notes.
  • Anonymized readiness summary deck excerpt.

Sample output

Preview a benchmark summary.

Review the readiness scorecard and mitigation notes.

View sample output

Request via Studio

Schedule a facilitated benchmark.

We run the benchmark with you and deliver the linked readout.

Request LLM Capacity Benchmark