Skip to content

feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332

Open
wu6u3tw wants to merge 13 commits into
mlcommons:mainfrom
wu6u3tw:feat/test04-compliance
Open

feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332
wu6u3tw wants to merge 13 commits into
mlcommons:mainfrom
wu6u3tw:feat/test04-compliance

Conversation

@wu6u3tw

@wu6u3tw wu6u3tw commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an audit: block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.

TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream compliance/TEST04/verify_performance.py).

Design (the two axes)

  • Axis A — run modification: expressed as a generic typed SampleOrderSpec (WITHOUT_REPLACEMENT | SINGLE(index)) carried on a RunSpec. No test-specific knowledge leaks into the load generator.
  • Axis B — verification: a pure post-run check, AuditTest.verify(runs) -> AuditVerdict, registered per test.

A generic orchestrator (commands/audit.py::run_audit) runs each RunSpec phase back-to-back via the existing setup_benchmark / run_benchmark_async path, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.

Config shape

audit:
  test: test04
  samples: 64         # reference phase query count
  audit_samples: 64   # audit (fixed-sample) phase count
  sample_index: 3     # MLCommons performance_issue_same_index
  threshold: 0.10     # audit qps must stay < ref qps * (1 + threshold)

AuditConfig is a discriminated-union-ready sub-model on BenchmarkConfig (parallel to AccuracyConfig) — no DatasetType.AUDIT, no audit fields polluting Dataset, no test04 boolean in RuntimeSettings.

What's included

  • compliance/__init__.pyAuditTest protocol + RunSpec/RunStats/RunArtifacts + registry
  • compliance/verdict.pyAuditVerdict + atomic write_verdict (tmp → fsync → rename → fsync)
  • compliance/tests/test04.pyTest04Audit + verify_test04
  • commands/audit.py — generic run_audit orchestrator
  • config/schema.pyAuditTestId + Test04Config/AuditConfig + BenchmarkConfig.audit
  • load_generatorSampleOrderSpec + SingleSampleOrder + factory dispatch
  • Unit tests + e2e integration test (offline + single-stream) against the echo server
  • docs/compliance_audit_plan.md — the design plan
  • WAN2.2 submission examples: offline_wan22_submission.yaml, single_stream_wan22_submission.yaml

Exit codes

benchmark from-config with an audit: block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError → 2, SetupError → 3, ExecutionError → 4). The on-disk audit_verdict.json is the durable record.

Testing

Unit + integration green; pre-commit run --all-files clean. The e2e test exercises the full audit:run_auditAuditVerdict flow for both max_throughput (offline) and concurrency=1 (single-stream).

🤖 Generated with Claude Code

@wu6u3tw wu6u3tw requested a review from a team June 3, 2026 20:53
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/compliance/__main__.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 3, 2026
…review

Address gemini-code-assist review on PR mlcommons#332:
- CLI catches OSError (PermissionError etc.) and write_verdict failures,
  not just FileNotFoundError/ValueError — all map to exit 2.
- _audit_marker tolerates non-dict results.json (isinstance guards) instead
  of raising AttributeError.
- _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wu6u3tw wu6u3tw requested review from arekay-nv and nv-alicheng June 3, 2026 22:19
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated
@wu6u3tw

wu6u3tw commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Update summary

All review feedback has been addressed. Here is what changed since the original submission:

Architecture (main concern)

  • audit: test04 now runs both phases in a single command — reference run then audit run back-to-back against the same endpoint, with automatic comparison and verdict output. No more 3-step manual workflow.

Config shape

  • type: audit dataset replaces the old settings.runtime.test04_sample_index and audit_n_samples runtime variables. Reference and audit sample counts are now independent and co-located with the dataset config — consistent with how type: accuracy datasets carry their own accuracy_config.
audit: "test04"

datasets:
  - name: wan22_prompts
    path: wan22_prompts.jsonl
    type: "performance"
    samples: 50          # reference phase query count (50–144)

  - name: wan22_audit
    path: wan22_prompts.jsonl
    type: "audit"
    samples: 25          # audit phase query count (25–50)
    audit_sample_index: 0

Robustness

  • Warning logged when audit: test04 is set but no type: audit dataset is present (previously silent fallback to index 0).
  • Phase failures (SetupError/ExecutionError) are caught and logged cleanly — no unhandled traceback, verdict not lost.
  • Report.from_snapshot wrapped in try/except in _run_stats_from_dir — malformed snapshots exit with code 2 instead of crashing.
  • Pre-flight audit_sample_index bounds check before dataset load.

Testing

  • New e2e integration test (test_audit_test04_two_phase_flow) exercises the full run_benchmark → two-phase flow against the echo server and asserts both phase subdirs are created and the flow completes gracefully.

Example config

  • Renamed offline_wan22_test04.yamlwan22_audit_test04.yaml per review suggestion.

@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 9057190 to b547f1d Compare June 4, 2026 23:14
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from cdbae64 to eae1234 Compare June 4, 2026 23:40
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread tests/unit/compliance/test_audit_test04.py Outdated
Comment thread README.md Outdated
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 3 times, most recently from b190d21 to e0de06f Compare June 5, 2026 21:03
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22.yaml
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from 385630c to c1e48bf Compare June 5, 2026 21:22
@wu6u3tw

wu6u3tw commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

All review feedback has been addressed. Here's a summary of what changed:

Architecture
audit: test04 now runs reference and audit phases in a single command back-to-back against the same endpoint — no more 3-step workflow, no endpoint-change risk. A single type: audit dataset entry drives both phases (carrying ref_samples, audit_samples, audit_sample_index).

Sample counts & index
ref_samples: 50, audit_samples: 25 — sized for WAN2.2 throughput. audit_sample_index: 3 — fixed per MLCommons audit.config (performance_issue_same_index=3 for WAN2.2).

SingleStream
Added wan22_single_stream_test04.yaml (concurrency=1, ref/audit samples=20 matching MLCommons min_query_count).

Durations
Perf configs: min=10min, max=4hr. Audit configs: min=10min, max=2hr. The 10-min minimum documents MLCommons compliance intent; counts take priority in the current session stop logic, with AND-semantics available as a future improvement.

Robustness fixes (Gemini)

  • write_verdict moved inside try-except in CLI
  • _audit_marker uses isinstance guards — no AttributeError possible
  • Report.from_snapshot wrapped in try/except (KeyError, TypeError) in _run_stats_from_dir

Cleanup

  • Test renamed to test_audit_test04.py
  • README.md removed from diff (rebased onto main)
  • Orphaned type: audit datasets in non-TEST04 configs now emit a warning; multiple audit datasets raise InputValidationError

@@ -0,0 +1,60 @@
# Offline TEST04 Compliance Run for WAN 2.2 (GB200/GB300)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of the file is not consistent. some have offline/singlestream, some have audit_test04. Can you fix it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed and replace the test name into output-caching test instead of using the term 'test04'

@nvzhihanj nvzhihanj left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — first-principles design review

Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough

Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/compliance/__init__.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/config/runtime_settings.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread tests/integration/commands/test_benchmark_command.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
@nvzhihanj

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review (first-principles design review)

Reviewed by: Claude · Depth: thorough
Codex review timed out on this 2046-line diff at xhigh reasoning (the load-gen + compliance surface is large); this pass is Claude-led. The one HIGH bug below was independently verified against the source.

Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline.

🔴 Re-design / Must-fix

# File Line Cat Why it needs a re-design
1 commands/benchmark/execute.py 1151 bug ref_samples is a dead write. Dataset.samples is consumed nowhere; ref_config never sets n_samples_to_issue, so the reference phase runs duration-driven and ignores ref_samples while the audit phase honors audit_samples → the compared phases run mismatched counts. Set n_samples_to_issue=ref_samples.
2 compliance/__init__.py 18 design No AuditTest abstraction. run_benchmark hardcodes if audit==TEST04; package exports only test04_*. Adding TEST01/05 = cross-cutting edits everywhere. Introduce an AuditTest protocol (plan_runs+verify) registered by AuditMode.
3 config/schema.py 82 design DatasetType.AUDIT is a fake dataset type the loader ignores, carrying test params on the shared Dataset model, then converted to PERFORMANCE. Move params to a structured audit: block; drop the fake type.
4 config/runtime_settings.py 90 design test04 boolean leaks into core load-gen. RuntimeSettings.test04/test04_sample_index + create_sample_order's if settings.test04. Use a generic sample-order strategy selector, not a per-test flag.

🟡 Should-fix

# File Line Cat Summary
5 commands/benchmark/execute.py 113 design _OVERRIDE_TEST04_SAMPLE_INDEX stringly-typed magic kwarg through **runtime_overrides; pass a typed run_spec instead.
6 commands/benchmark/execute.py 1146 design Two-phase model_copy surgery is fragile (root cause of #1; ref phase also skips _validate_audit_test04). Use a declarative RunSpec + validate before any phase runs.
7 tests/integration/commands/test_benchmark_command.py 209 testing _run_benchmark_test04 has no unit test; the one integration test asserts verdict OR error with min_duration_ms=0 — the regime that hides bug #1.
8 config/schema.py 666 design audit bare top-level enum; params scattered, threshold hardcoded. Use a structured compliance sub-config (like accuracy_config).
9 compliance/test04.py 206 design QPS compared across phases with different counts/contents (upstream TEST04 uses the same query set); completion guard only protects the FAIL direction. Extends the existing fairness thread; compounded by #1.

🔵 Consider

# File Line Cat Summary
10 compliance/test04.py 175 design verify_test04_dirs vs verify_test04_from_reports duplication; dir-swap guard in one path only. Collapse to one core + thin adapters.
11 commands/benchmark/execute.py 446 bug audit_sample_index bound-checked vs requested counts, not the loaded dataset size, until phase 2 — an out-of-range index wastes a full reference run.

Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto run_benchmark via per-phase config surgery and untyped overrides instead of a first-class audit-test abstraction (#2). Fixing #2/#3/#4 (an AuditTest that emits typed RunSpecs + a generic ordering strategy) would dissolve most of the others structurally.

Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance.

Comment thread examples/09_Wan22_VideoGen_Example/single_stream_wan22.yaml Outdated
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 10, 2026
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from d34459b to 642ec6c Compare June 10, 2026 22:20
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 10, 2026
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 642ec6c to b40b0ef Compare June 10, 2026 22:46
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 11, 2026
… header

Review Council (Claude) findings on PR mlcommons#332:
- examples hardcoded num_workers despite §8 claiming it was dropped; remove it
  (use endpoint default, per viraatc's request) so the traceability row is true
- single-stream header said '(independent counts)' but uses equal 20/20; align
  to '(equal counts here)' matching the offline sibling and §5/§8 framing

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 11, 2026
… PR)

Review Council (Claude, 2nd pass) on PR mlcommons#332: three §8 traceability rows
described the example YAMLs as future ('land/dropped at implementation',
'plan doc only'), but the PR now ships offline_wan22_submission.yaml and
single_stream_wan22_submission.yaml. Reword to reflect they're included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wu6u3tw and others added 4 commits June 11, 2026 13:55
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erSpec

Adds SingleSampleOrder — always yields one fixed dataset index — and
updates create_sample_order to switch on SampleOrderSpec.fixed_index.
This is the generic load-gen seam for TEST04's fixed-sample audit phase;
the load generator has no test-specific knowledge.

Also fixes pre-commit hooks to use python3 (system has no 'python' symlink).
… block

Implements tasks 3-8 of the compliance-audit plan — the reviewer-requested
clean-architecture redesign replacing the bolted-on TEST04 integration:

- config/schema.py: AuditTestId enum + AuditConfig sub-model + BenchmarkConfig.audit.
  Replaces DatasetType.AUDIT and the per-Dataset audit fields. Audit params
  (ref/audit sample counts, sample_index, threshold) are co-located in a
  structured block, parallel to AccuracyConfig.
- compliance/__init__.py: generic AuditTest protocol (plan_runs + verify),
  RunSpec/RunStats/RunArtifacts types, and a registry resolved by AuditTestId.
- compliance/verdict.py: AuditVerdict + atomic write_verdict (tmp->fsync->rename->fsync).
- compliance/tests/test04.py: Test04Audit emits declarative RunSpecs and verifies
  via verify_test04; registered with the framework.
- commands/audit.py: _run_audit orchestrator runs each RunSpec phase back-to-back
  through the existing setup/run path, then verifies and writes the verdict.
- commands/benchmark/execute.py: typed run_spec seam in setup_benchmark.
- commands/benchmark/cli.py: audit dispatch + exit code.

Run modification flows through the generic SampleOrderSpec (tasks 1-2) so no
test-specific knowledge leaks into the load generator.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… single-stream)

Exercises the redesigned audit: config block → run_audit orchestrator →
AuditVerdict against the echo server, parametrized over max_throughput
(offline) and concurrency=1 (single-stream). Asserts both phase subdirs
(reference/, test04/) are created, the verdict file is written, and
run_benchmark returns an AuditVerdict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@viraatc viraatc left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, looking forward to impl!

Comment thread docs/compliance_audit_plan.md Outdated
> **LLM nuance.** MLPerf exempts variable-length-input LLMs from TEST04 because prefix
> caching legitimately speeds up identical prompts. On an LLM endpoint, TEST04 will see
> real prefix-cache gains; the tolerance (and whether the audit run disables prefix cache)
> is a deliberate knob. We build it faithfully to the reference (±10% / ±20%) and expose

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't enable TEST04 for LLM.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifie.d

Comment thread docs/compliance_audit_plan.md Outdated
Comment thread examples/09_Wan22_VideoGen_Example/single_stream_wan22_submission.yaml Outdated
Comment thread docs/compliance_audit_plan.md Outdated
│ 3. verify(runs) ; 4. write_verdict (atomic)
verify_TEST04.txt + audit_verdict.json

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, unify the output to be json so it's easier to parse.

And do we need 2 files? Ideally one json should be enough (containing all information). On traditional MLPerf side we can always use scripts to make it pass

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLCommons run_verification.py takes verify_.txt that req a Performance check pass: True We can remove it if MLComm run_verification changed.

Comment thread docs/compliance_audit_plan.md Outdated
│ back-to-back, same endpoint
├─ Phase 2 "test04" ─ 64 × sample[0] ────────► RunArtifacts[1] (qps_audit)
├─ verdict = Test04Audit.verify([ref, audit])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure verdict is the right word to use (I know AI loves this word). Maybe auditResults will be more explicit and lower the entropy of understanding

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread docs/compliance_audit_plan.md
Comment thread docs/compliance_audit_plan.md Outdated

```python
class AuditTestId(str, Enum):
TEST04 = "test04"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, recommend using an explicit name like output_caching_audit

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done renaming

wu6u3tw and others added 2 commits June 15, 2026 15:46
The 30s ready-check window was too tight under load: on a contended host the
metrics-aggregator/event-logger subprocesses intermittently miss it, raising
"Ready check failed: 1/2 signals received within 30.0s". Surfaced while driving
the GB300x16 TEST04 audit from a busy login node. 120s gives subprocess launch
ample headroom without changing steady-state behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings the design doc (docs/compliance_audit_plan.md, rebased on the
latest mlcommons#332 web edits) and the two WAN2.2 submission example configs
(offline + single-stream, using the audit: block) onto the implementation
branch so the redesign ships as one coherent PR: framework + tests + doc
+ runnable examples.

The doc's exit-code contract and module layout are corrected to match this
implementation: samples/audit_samples/sample_index/threshold, no standalone
verifier CLI, and errors via the repo-wide handler (SetupError → 3,
ExecutionError → 4) rather than a flat exit 2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 1806fde to ba09656 Compare June 16, 2026 16:19
@wu6u3tw wu6u3tw changed the title docs: compliance audit module redesign plan feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework Jun 16, 2026
pip-audit flagged aiohttp 3.14.0 with CVE-2026-54273..54280 (all fixed in
3.14.1). aiohttp is a test-only dependency (mock HTTP server fixture).
pip-audit clean after the bump.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@wu6u3tw wu6u3tw left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — bwrap sandbox blocked, no sudo to relax userns) · Depth: thorough

4 findings posted inline. Verdict-correctness finding (threshold) is the one to prioritize.

Comment thread src/inference_endpoint/compliance/tests/test04.py Outdated
Comment thread src/inference_endpoint/compliance/tests/test04.py Outdated
Comment thread src/inference_endpoint/compliance/__init__.py Outdated
wu6u3tw and others added 6 commits June 16, 2026 11:30
- threshold (high): verify() now receives the AuditConfig and passes the
  configured threshold to verify_test04. Previously the protocol verify(runs)
  had no config access, so a non-default audit.threshold was silently ignored
  (always 0.10). AuditTest.verify gains a cfg parameter; run_audit passes it.
- samples=None (medium): RunSpec.n_samples is now int | None and None
  propagates to RuntimeSettings as the full-dataset/duration-driven default
  (matching the documented "None → full dataset"), instead of mapping to 0
  (empty phase). n_requested is derived from the report's issued count when
  the spec didn't fix a count.
- verify() arity (low): explicit len(runs) == 2 guard with a clear message
  instead of an IndexError on runs[0]/runs[1].
- redundant except (low): except (OSError, PermissionError) → except OSError
  (PermissionError ⊂ OSError).

Doc updated for the new verify(runs, cfg) signature. Unit tests pass; the
e2e flake under heavy local load is the worker-init timeout, green in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… arity

Covers the two review fixes at the protocol-method level (the existing tests
only exercised the verify_test04 pure function):
- verify() honors the configured cfg.threshold (audit 115 vs ref 100 → FAIL
  at 0.10, PASS at 0.20) — would have caught the dropped-threshold bug.
- verify() rejects a phase count != 2 with a clear ValueError.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…caching_audit_test

Rename the audit "verdict" concept to "audit results" and the opaque
"TEST04" test identifier to the descriptive "output_caching_audit_test"
across code, tests, examples, and the design doc.

- compliance/verdict.py → results.py; AuditVerdict → AuditResults;
  write_verdict → write_audit_results; output audit_verdict.json → audit_results.json
- compliance/tests/test04.py → output_caching_audit_test.py;
  Test04Audit → OutputCachingAuditTest; Test04Config → OutputCachingAuditTestConfig;
  verify_test04 → verify_output_caching_audit_test
- AuditTestId.TEST04="test04" → OUTPUT_CACHING_AUDIT_TEST="output_caching_audit_test";
  output verify_TEST04.txt → verify_OUTPUT_CACHING_AUDIT_TEST.txt; phase label/subdir renamed
- Genuine upstream MLPerf TEST04 references (test matrix, mechanism,
  compliance/nvidia/TEST04/audit.config) are preserved in the doc.

Also fix a latent bug surfaced by the rename: the results test_id used
str(AuditTestId.X), which for a (str, Enum) yields "AuditTestId.X" rather
than the value — corrupting the verify_<TEST>.txt filename and the JSON
"test" field. Use .value, and replace the circular test assertion that
masked it with one pinning the wire contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…robust results

Address review-council findings and reconcile the design doc with the code:

- audit.py: the load-pattern guard rejected only POISSON, but LoadPatternType
  also has MULTI_TURN/BURST/STEP — these slipped through despite the error
  message and design doc promising "max_throughput or concurrency" only.
  Switch to an allow-list so paced/turn-sequenced loads are rejected up front.
- compliance/__init__.py: the registry keyed on str(test.test_id), which for a
  (str, Enum) yields "AuditTestId.OUTPUT_CACHING_AUDIT_TEST" rather than the
  wire value. Key on .value (accepting a raw string for offline re-checks).
- results.py: write the fixed "test"/"passed" keys last so a stray
  details["test"]/["passed"] cannot shadow the authoritative fields.
- execute.py: a user-interrupted (Ctrl-C) main run no longer falls through and
  silently runs the audit phases.
- docs/compliance_audit_plan.md: align with the code — describe the dedicated
  probe-load for the sample-index bounds check (not "phase 1 setup") and mark
  RunSpec.n_samples as int | None.

Add regression tests for registry-by-value resolution and the
details-cannot-override-fixed-keys invariant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants