Methodology — The Transparent AI Index

What this scorecard measures

A lab can publish figures showing high consumption and still earn an A. A lab can run an extremely efficient model and still earn an F if it publishes nothing. Disclosure is the precondition for accountability, comparison, and improvement. Until labs publish measurable, methodology-backed per-query figures, users have no way to make informed choices about the environmental cost of the queries they send, and policymakers have no defensible basis for regulation.

The framing is borrowed from nutrition labeling. A calorie count on a menu does not tell you whether a meal is healthy. It tells you the information exists, in a comparable form, so the diner can decide. That is the bar this scorecard applies to AI inference.

The rubric

Criterion	Weight	What earns full marks
Per-query figures published	25%	Specific energy (Wh) and water (mL) numbers per inference query, tied to a named, current model.
Methodology disclosed	25%	A technical document explains what is measured, what is excluded, the measurement boundary, and how figures were derived.
External audit	20%	Figures are independently audited or peer-reviewed by a credentialed third party.
Recency	15%	Disclosure covers the lab's current flagship model and was updated within the last 12 months.
Scope coverage	15%	Covers text, image, and video generation; reasoning vs standard modes; and acknowledges training, networking, and water-from-power-generation.

Grade thresholds

Grade	Threshold
A	85–100. Measured per-query figures, full methodology, recent, broad scope.
B	70–84. Measured or rigorously modeled figures, but with material gaps in scope, recency, or audit.
C	55–69. Some quantitative disclosure, but methodology incomplete or scope narrow.
D	40–54. Marketing-grade claims only: a single number with no methodology, no model attribution, or no audit.
F	Below 40. No per-query figures of any kind. Corporate ESG reporting does not earn credit.

What counts as a disclosure

A disclosure must come from the lab itself and meet all of the following: it names a specific model; states quantitative figures in standard units (Wh, mL/L, gCO2e); defines the measurement boundary; and is dated and tied to a specific time window.

What this scorecard excludes

Training energy and water. Inference is what users directly drive. Training is a separate question. Corporate-level ESG reports. Per-query specificity is the bar. Hardware embodied carbon. Important, but not currently disclosed by any lab in a per-query form. Downstream device and network energy. Excluded for consistency with how labs themselves report.

Limitations

Methodologies across labs are not directly comparable. Google's median text prompt and Mistral's full-page generation measure different functional units. The grades reward the existence and quality of disclosure, not the headline number. Modeled estimates carry their own uncertainty and should be read as ranges. This is v1; the rubric will evolve as more labs disclose and as standards bodies converge.