feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332
feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332wu6u3tw wants to merge 13 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update summaryAll review feedback has been addressed. Here is what changed since the original submission: Architecture (main concern)
Config shape
audit: "test04"
datasets:
- name: wan22_prompts
path: wan22_prompts.jsonl
type: "performance"
samples: 50 # reference phase query count (50–144)
- name: wan22_audit
path: wan22_prompts.jsonl
type: "audit"
samples: 25 # audit phase query count (25–50)
audit_sample_index: 0Robustness
Testing
Example config
|
9057190 to
b547f1d
Compare
cdbae64 to
eae1234
Compare
b190d21 to
e0de06f
Compare
385630c to
c1e48bf
Compare
|
All review feedback has been addressed. Here's a summary of what changed: Architecture Sample counts & index SingleStream Durations Robustness fixes (Gemini)
Cleanup
|
| @@ -0,0 +1,60 @@ | |||
| # Offline TEST04 Compliance Run for WAN 2.2 (GB200/GB300) | |||
There was a problem hiding this comment.
The naming of the file is not consistent. some have offline/singlestream, some have audit_test04. Can you fix it?
There was a problem hiding this comment.
renamed and replace the test name into output-caching test instead of using the term 'test04'
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — first-principles design review
Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough
Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.
Review Council — Multi-AI Code Review (first-principles design review)Reviewed by: Claude · Depth: thorough Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline. 🔴 Re-design / Must-fix
🟡 Should-fix
🔵 Consider
Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance. |
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d34459b to
642ec6c
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
642ec6c to
b40b0ef
Compare
… header Review Council (Claude) findings on PR mlcommons#332: - examples hardcoded num_workers despite §8 claiming it was dropped; remove it (use endpoint default, per viraatc's request) so the traceability row is true - single-stream header said '(independent counts)' but uses equal 20/20; align to '(equal counts here)' matching the offline sibling and §5/§8 framing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… PR) Review Council (Claude, 2nd pass) on PR mlcommons#332: three §8 traceability rows described the example YAMLs as future ('land/dropped at implementation', 'plan doc only'), but the PR now ships offline_wan22_submission.yaml and single_stream_wan22_submission.yaml. Reword to reflect they're included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erSpec Adds SingleSampleOrder — always yields one fixed dataset index — and updates create_sample_order to switch on SampleOrderSpec.fixed_index. This is the generic load-gen seam for TEST04's fixed-sample audit phase; the load generator has no test-specific knowledge. Also fixes pre-commit hooks to use python3 (system has no 'python' symlink).
… block Implements tasks 3-8 of the compliance-audit plan — the reviewer-requested clean-architecture redesign replacing the bolted-on TEST04 integration: - config/schema.py: AuditTestId enum + AuditConfig sub-model + BenchmarkConfig.audit. Replaces DatasetType.AUDIT and the per-Dataset audit fields. Audit params (ref/audit sample counts, sample_index, threshold) are co-located in a structured block, parallel to AccuracyConfig. - compliance/__init__.py: generic AuditTest protocol (plan_runs + verify), RunSpec/RunStats/RunArtifacts types, and a registry resolved by AuditTestId. - compliance/verdict.py: AuditVerdict + atomic write_verdict (tmp->fsync->rename->fsync). - compliance/tests/test04.py: Test04Audit emits declarative RunSpecs and verifies via verify_test04; registered with the framework. - commands/audit.py: _run_audit orchestrator runs each RunSpec phase back-to-back through the existing setup/run path, then verifies and writes the verdict. - commands/benchmark/execute.py: typed run_spec seam in setup_benchmark. - commands/benchmark/cli.py: audit dispatch + exit code. Run modification flows through the generic SampleOrderSpec (tasks 1-2) so no test-specific knowledge leaks into the load generator. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… single-stream) Exercises the redesigned audit: config block → run_audit orchestrator → AuditVerdict against the echo server, parametrized over max_throughput (offline) and concurrency=1 (single-stream). Asserts both phase subdirs (reference/, test04/) are created, the verdict file is written, and run_benchmark returns an AuditVerdict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
viraatc
left a comment
There was a problem hiding this comment.
lgtm, looking forward to impl!
| > **LLM nuance.** MLPerf exempts variable-length-input LLMs from TEST04 because prefix | ||
| > caching legitimately speeds up identical prompts. On an LLM endpoint, TEST04 will see | ||
| > real prefix-cache gains; the tolerance (and whether the audit run disables prefix cache) | ||
| > is a deliberate knob. We build it faithfully to the reference (±10% / ±20%) and expose |
There was a problem hiding this comment.
We shouldn't enable TEST04 for LLM.
| │ | ||
| │ 3. verify(runs) ; 4. write_verdict (atomic) | ||
| ▼ | ||
| verify_TEST04.txt + audit_verdict.json |
There was a problem hiding this comment.
If possible, unify the output to be json so it's easier to parse.
And do we need 2 files? Ideally one json should be enough (containing all information). On traditional MLPerf side we can always use scripts to make it pass
There was a problem hiding this comment.
MLCommons run_verification.py takes verify_.txt that req a Performance check pass: True We can remove it if MLComm run_verification changed.
| │ back-to-back, same endpoint | ||
| ├─ Phase 2 "test04" ─ 64 × sample[0] ────────► RunArtifacts[1] (qps_audit) | ||
| │ | ||
| ├─ verdict = Test04Audit.verify([ref, audit]) |
There was a problem hiding this comment.
Not sure verdict is the right word to use (I know AI loves this word). Maybe auditResults will be more explicit and lower the entropy of understanding
|
|
||
| ```python | ||
| class AuditTestId(str, Enum): | ||
| TEST04 = "test04" |
There was a problem hiding this comment.
Again, recommend using an explicit name like output_caching_audit
The 30s ready-check window was too tight under load: on a contended host the metrics-aggregator/event-logger subprocesses intermittently miss it, raising "Ready check failed: 1/2 signals received within 30.0s". Surfaced while driving the GB300x16 TEST04 audit from a busy login node. 120s gives subprocess launch ample headroom without changing steady-state behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings the design doc (docs/compliance_audit_plan.md, rebased on the latest mlcommons#332 web edits) and the two WAN2.2 submission example configs (offline + single-stream, using the audit: block) onto the implementation branch so the redesign ships as one coherent PR: framework + tests + doc + runnable examples. The doc's exit-code contract and module layout are corrected to match this implementation: samples/audit_samples/sample_index/threshold, no standalone verifier CLI, and errors via the repo-wide handler (SetupError → 3, ExecutionError → 4) rather than a flat exit 2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1806fde to
ba09656
Compare
pip-audit flagged aiohttp 3.14.0 with CVE-2026-54273..54280 (all fixed in 3.14.1). aiohttp is a test-only dependency (mock HTTP server fixture). pip-audit clean after the bump. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wu6u3tw
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Claude (Codex unavailable — bwrap sandbox blocked, no sudo to relax userns) · Depth: thorough
4 findings posted inline. Verdict-correctness finding (threshold) is the one to prioritize.
- threshold (high): verify() now receives the AuditConfig and passes the configured threshold to verify_test04. Previously the protocol verify(runs) had no config access, so a non-default audit.threshold was silently ignored (always 0.10). AuditTest.verify gains a cfg parameter; run_audit passes it. - samples=None (medium): RunSpec.n_samples is now int | None and None propagates to RuntimeSettings as the full-dataset/duration-driven default (matching the documented "None → full dataset"), instead of mapping to 0 (empty phase). n_requested is derived from the report's issued count when the spec didn't fix a count. - verify() arity (low): explicit len(runs) == 2 guard with a clear message instead of an IndexError on runs[0]/runs[1]. - redundant except (low): except (OSError, PermissionError) → except OSError (PermissionError ⊂ OSError). Doc updated for the new verify(runs, cfg) signature. Unit tests pass; the e2e flake under heavy local load is the worker-init timeout, green in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… arity Covers the two review fixes at the protocol-method level (the existing tests only exercised the verify_test04 pure function): - verify() honors the configured cfg.threshold (audit 115 vs ref 100 → FAIL at 0.10, PASS at 0.20) — would have caught the dropped-threshold bug. - verify() rejects a phase count != 2 with a clear ValueError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…caching_audit_test Rename the audit "verdict" concept to "audit results" and the opaque "TEST04" test identifier to the descriptive "output_caching_audit_test" across code, tests, examples, and the design doc. - compliance/verdict.py → results.py; AuditVerdict → AuditResults; write_verdict → write_audit_results; output audit_verdict.json → audit_results.json - compliance/tests/test04.py → output_caching_audit_test.py; Test04Audit → OutputCachingAuditTest; Test04Config → OutputCachingAuditTestConfig; verify_test04 → verify_output_caching_audit_test - AuditTestId.TEST04="test04" → OUTPUT_CACHING_AUDIT_TEST="output_caching_audit_test"; output verify_TEST04.txt → verify_OUTPUT_CACHING_AUDIT_TEST.txt; phase label/subdir renamed - Genuine upstream MLPerf TEST04 references (test matrix, mechanism, compliance/nvidia/TEST04/audit.config) are preserved in the doc. Also fix a latent bug surfaced by the rename: the results test_id used str(AuditTestId.X), which for a (str, Enum) yields "AuditTestId.X" rather than the value — corrupting the verify_<TEST>.txt filename and the JSON "test" field. Use .value, and replace the circular test assertion that masked it with one pinning the wire contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…robust results Address review-council findings and reconcile the design doc with the code: - audit.py: the load-pattern guard rejected only POISSON, but LoadPatternType also has MULTI_TURN/BURST/STEP — these slipped through despite the error message and design doc promising "max_throughput or concurrency" only. Switch to an allow-list so paced/turn-sequenced loads are rejected up front. - compliance/__init__.py: the registry keyed on str(test.test_id), which for a (str, Enum) yields "AuditTestId.OUTPUT_CACHING_AUDIT_TEST" rather than the wire value. Key on .value (accepting a raw string for offline re-checks). - results.py: write the fixed "test"/"passed" keys last so a stray details["test"]/["passed"] cannot shadow the authoritative fields. - execute.py: a user-interrupted (Ctrl-C) main run no longer falls through and silently runs the audit phases. - docs/compliance_audit_plan.md: align with the code — describe the dedicated probe-load for the sample-index bounds check (not "phase 1 setup") and mark RunSpec.n_samples as int | None. Add regression tests for registry-by-value resolution and the details-cannot-override-fixed-keys invariant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an
audit:block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream
compliance/TEST04/verify_performance.py).Design (the two axes)
SampleOrderSpec(WITHOUT_REPLACEMENT | SINGLE(index)) carried on aRunSpec. No test-specific knowledge leaks into the load generator.AuditTest.verify(runs) -> AuditVerdict, registered per test.A generic orchestrator (
commands/audit.py::run_audit) runs eachRunSpecphase back-to-back via the existingsetup_benchmark/run_benchmark_asyncpath, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.Config shape
AuditConfigis a discriminated-union-ready sub-model onBenchmarkConfig(parallel toAccuracyConfig) — noDatasetType.AUDIT, no audit fields pollutingDataset, notest04boolean inRuntimeSettings.What's included
compliance/__init__.py—AuditTestprotocol +RunSpec/RunStats/RunArtifacts+ registrycompliance/verdict.py—AuditVerdict+ atomicwrite_verdict(tmp → fsync → rename → fsync)compliance/tests/test04.py—Test04Audit+verify_test04commands/audit.py— genericrun_auditorchestratorconfig/schema.py—AuditTestId+Test04Config/AuditConfig+BenchmarkConfig.auditload_generator—SampleOrderSpec+SingleSampleOrder+ factory dispatchdocs/compliance_audit_plan.md— the design planoffline_wan22_submission.yaml,single_stream_wan22_submission.yamlExit codes
benchmark from-configwith anaudit:block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError→ 2,SetupError→ 3,ExecutionError→ 4). The on-diskaudit_verdict.jsonis the durable record.Testing
Unit + integration green;
pre-commit run --all-filesclean. The e2e test exercises the fullaudit:→run_audit→AuditVerdictflow for both max_throughput (offline) and concurrency=1 (single-stream).🤖 Generated with Claude Code