test: categorize tests and select via pytest --unit/--plugin openai/ … flags#5945
Merged
Conversation
Contributor
STT Test ResultsStatus: ✗ Some tests failed
Failed Tests
Skipped Tests
Triggered by workflow run #2249 |
pytest --unit/--stt/--plugin [openai]/etc.pytest --unit/--plugin openai/ … flags
a83ccac to
ad6a5ee
Compare
theomonnom
approved these changes
Jun 3, 2026
… list The `unit-tests` make target hardcoded a list of ~35 test files. That list drifts as tests are added, and it existed only because `pytest tests/` cannot be used directly: collecting the full tree imports every module, and some modules (e.g. test_convert_html_docs -> bs4, provider plugin tests) fail to import without optional/cloud dependencies, crashing collection. Mark unit modules with `pytestmark = pytest.mark.unit` and add a `--unit` pytest flag (tests/conftest.py) that filters collection *statically* via `pytest_ignore_collect` -- it reads each file's text and skips non-unit modules before they are imported, so their dependencies are never required. - mark the 35 existing unit modules with `pytestmark = pytest.mark.unit` - add `--unit` option + static `pytest_ignore_collect` hook in conftest - register the `unit` marker in pyproject.toml - `makefile` / `tests/Makefile`: `unit-tests` now runs `pytest --unit tests/` New unit tests opt in with one line; selection no longer needs a maintained list and can never import a module it didn't intend to. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
Extends the static-collection scheme from --unit to all test categories,
consolidating the selection logic that was split across ./makefile and
./tests/Makefile into pytest marker flags.
Categories (one marker per module; provider-bearing ones take an arg):
unit | plugin("<provider>") | realtime | stt | tts | evals
- conftest: --unit/--plugin/--realtime/--stt/--tts/--evals flags.
* pre-import skip (substring scan, no regex) keeps each flag from importing
other categories' modules / optional deps; benchmarked ~2x faster than re.
* provider arg filtering via iter_markers() deselection, e.g. --plugin google.
* pytest_ignore_collect now returns None (not False) when keeping a module,
so --ignore and other plugins still compose.
- register all six markers in pyproject.toml; testpaths=["tests"] so the
category flags parse without a positional path eating the flag value.
- add beautifulsoup4 + markdownify to the dev group (makes test_convert_html_docs
importable as a unit test); uv.lock dev manifest updated minimally.
- test_llm.py left unmarked (fully commented out, 0 live tests).
NOTE: --unit membership across the 12 modules the old 35-list omitted is still
under review (two have pre-existing failures); not yet finalized.
https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5
Co-authored-by: Claude <noreply@anthropic.com>
…ures)
Investigated all 48 unit-marked modules per-module (orig 35-list vs the 13
folded in). The sandbox is not a clean baseline — orig35 also has env-only
non-passes (test_room needs a livekit-server; test_agent_session needs
OPENAI_API_KEY) — so failures were classified by *nature*, not exit code:
NEW, green in CI (kept as unit): test_audio_emitter, test_cli_log_level,
test_convert_html_docs, test_drain_timeout, test_interruption_failover,
test_nested_agent_task, test_personaplex_realtime_model,
test_speech_start_time_persistence, test_stt_base, test_user_turn_exceeded,
test_vad (only fails here on an unfetched git-LFS silero model).
NEW, genuine deterministic bugs (quarantined — left unmarked, like test_llm):
- test_speaker_id_grouping: trailing-whitespace assertion + re.match(None)
TypeError.
- test_audio_recognition_aclose: aclose() hits self._stt_pipeline on a
partially-initialized object -> AttributeError.
--unit now selects 46 modules. The 2 quarantined modules are real pre-existing
failures (never gated by the old 35-list); they need a separate fix before
joining the gate.
https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5
Co-authored-by: Claude <noreply@anthropic.com>
Categorization (axis = what a test NEEDS to run) is now explicit and complete:
every test module carries a category marker and each category has a --<category>
selection flag. Health (pass/xfail) is kept as a separate axis.
- Mark all: fix test_personaplex_realtime_model (unit -> realtime("nvidia"), it
imports the nvidia/sphn plugin) and test_vad (unit -> plugin("silero"), it
loads the ONNX model at import so it isn't hermetic). Mark the two previously
unmarked modules as unit. test_llm is fully commented out (0 tests) and stays
exempt.
- xfail (strict) the genuine, environment-independent failures instead of
hiding them: test_speaker_id_grouping (trailing-whitespace expectation +
re.match(None) TypeError) and test_audio_recognition_aclose (stale mock
missing _stt_pipeline). strict=True flips CI red when the bug is fixed.
- Enforce categorization: collection fails with a clear fix hint if a module
with tests has no category marker. Escape hatch --allow-uncategorized for
local dev only (default = enforce, so CI stays strict).
- Add `pytest --list-categories` (import-free) listing modules per category.
- Translate file-path pytest invocations to category flags so runners and the
markers can't drift: tests/Makefile (--tts, --realtime), tests.yml evals
(--evals), test-stt.yml (--stt), test-realtime.yml (--realtime). Left the
blockguard package test and examples/ eval as intentionally separate.
- Document the category system in AGENTS.md.
https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5
Co-authored-by: Claude <noreply@anthropic.com>
Drop the "module(s)" placeholder for proper singular/plural via a small _plural() helper, and narrow the category column. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
Name promised an iterator but it materialized a list; yield lazily instead. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
Use a triple-quoted block for the marker hint, and describe --allow-uncategorized as temporarily disabling the rule (in the hint, the option help, and AGENTS.md) instead of framing it as a never-on-CI escape hatch. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
Keep the marker comment factual and state the xfail reasons as the concrete failure, not commentary. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
test_convert_html_docs tests .github/convert_html_docs.py (a pdoc3 HTML->markdown script), not the library. Give it its own `docs` category instead of `unit`. Its deps (beautifulsoup4, markdownify) already live in the `docs` dependency group and in the lockfile, so the redundant `dev` entries are dropped and uv.lock is restored to main (the earlier diff was only a lockfile format-revision reformat). https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
It exercises the nvidia plugin's RealtimeModel with no credentials or network (import + logic only), so it belongs with provider plugin tests, not the live realtime suite. This also keeps --realtime scoped to tests/test_realtime/. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
The target used to run everything except tests/test_realtime/, test_stt.py and test_tts.py; switching it to --unit silently dropped the plugin and evals tests. Select the same set via categories: --unit --plugin --evals --docs. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
Read category names from the [tool.pytest.ini_options] markers in pyproject.toml instead of duplicating them in conftest. config.getini() isn't usable here because the --<category> options are registered in pytest_addoption, which runs before pytest parses the ini; so the markers block is read from the file (the same source pytest uses). The enforcement hint is generated from the list too, so it can't drift. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
cli.log.setup_logging() (run by the AgentServer/CLI tests now gated as --unit, e.g. test_cli_log_level and test_drain_timeout) installs handlers on the root and `livekit` loggers process-wide and never restores them. Leaked, that state deadlocks test_ipc::test_slow_initialization, which streams spawned-worker logs through those loggers. Snapshot the root + livekit loggers once per session and revert them after every test via an autouse fixture. Cost is a few attribute reads per test (no threads, no I/O). https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
…nai") New module from main; it builds an openai RealtimeModel and inspects the session-update payload without ever connecting (fake key, no network), so it's a provider plugin test like the other test_plugin_* modules. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
Resolve the deterministic test bugs the strict-xfails flagged, so they pass: - test_speaker_id_grouping: strip the fragment text before wrapping, and guard a None speaker_id before the ignore-pattern re.match (both in the test's own local helpers). - test_audio_recognition_aclose: the hand-built mock bypassed __init__ and had gone stale w.r.t. aclose(); set the attributes aclose() actually touches (_stt_pipeline, _stt_consumer_atask, _interruption_atask, _backchannel_boundary_timer) and drop the unused _stt_atask. Removes the three @pytest.mark.xfail(strict=True) markers. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
New module from main (#5947); it covers get_inference_headers with fakes (no network or credentials), so it's a hermetic unit test. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>
b59db1d to
b893ca2
Compare
Member
Author
|
Rebased on main and added |
longcw
pushed a commit
that referenced
this pull request
Jun 4, 2026
…i`/ … flags (#5945) Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR: Add explicit source of truth for tests categorization and unified selection
Select and run tests as
Categorize new test modules via
Motivation
Currently there is no clear indication of which tests to run where and how.
./makefilehas a manually maintained list of unit-tests modules to run: all new modules must be added in order to be tested in CI, but nothing keeps this in check. This leads to stale/ignored testsPlain
pytest ./testsdoesn't work on most environment, even with-kselection, as the collection will fail due to missing dependencies.This PR fixes both problems by defining single source of truth and mechanism for tests isolation.
Changes
pytestmark = pytest.mark.<category>to existing modulespytest --<category>flags that runs tests of that category (with dependencies isolation)pytest --list-categoriesarg that lists all modules and their categories--allow-uncategorized)test_audio_recognition_aclose.pyandtest_speaker_id_grouping.pytest_cli_log_level.pyCategories and dependencies:
--unit--plugin [name]--stt/--tts--realtime--evals--docsdocsdependency groupNotes
Stale tests ignored in CI before this PR
test_audio_emitter.pytest_audio_recognition_aclose.pytest_cli_log_level.pytest_drain_timeout.pytest_interruption/test_interruption_failover.pytest_nested_agent_task.pytest_speaker_id_grouping.pytest_speech_start_time_persistence.pytest_stt_base.pytest_user_turn_exceeded.pypytest --list-categoriesoutputNew
### Testingsection in AGENTS.md (authored by Claude)Testing
Test categories
Every test module declares exactly one category via a module-level marker, and
each category has a matching
--<category>selection flag. Selection happensbefore import, so a category run never imports (or fails on) modules outside
it.
pytest.mark.unit--unitpytest.mark.plugin("name")--plugin [name]pytest.mark.stt--stttests/test_stt.py)pytest.mark.tts--ttstests/test_tts.py)pytest.mark.realtime("name")--realtime [name]pytest.mark.evals--evalspytest.mark.docs--docs.github/Adding a test: give the new module a category marker (
pytestmark = pytest.mark.unit, etc.) — collection fails with a hint if it lacks one. Runpytest with the
--allow-uncategorizedoption to temporarily disable this rule(CI keeps it on by default).
2 STT failures seem to be unrelated to the changes.