test: categorize tests and select via `pytest --unit`/`--plugin openai`/ … flags by Bobronium · Pull Request #5945 · livekit/agents

Bobronium · 2026-06-02T22:16:13Z

TL;DR: Add explicit source of truth for tests categorization and unified selection

Select and run tests as

uv run pytest --unit 
uv run pytest --realtime
uv run pytest --plugin openai

Categorize new test modules via

import pytest
...
pytestmark = pytest.mark.unit  # or pytest.mark.plugin("provider-name")

Motivation

Currently there is no clear indication of which tests to run where and how. ./makefile has a manually maintained list of unit-tests modules to run: all new modules must be added in order to be tested in CI, but nothing keeps this in check. This leads to stale/ignored tests

Plain pytest ./tests doesn't work on most environment, even with -k selection, as the collection will fail due to missing dependencies.

This PR fixes both problems by defining single source of truth and mechanism for tests isolation.

Changes

Add pytestmark = pytest.mark.<category> to existing modules
Add pytest --<category> flags that runs tests of that category (with dependencies isolation)
Add pytest --list-categories arg that lists all modules and their categories
Require categorizing future tests to prevent orphaned tests (can be temporarily disabled via --allow-uncategorized)
Fix broken tests in test_audio_recognition_aclose.py and test_speaker_id_grouping.py
Fix deadlock caused by test_cli_log_level.py

Categories and dependencies:

Category	Needs
`--unit`	—
`--plugin [name]`	that provider's dependencies/keys
`--stt` / `--tts`	cross-provider STT/TTS suite + keys
`--realtime`	live realtime model + keys
`--evals`	LiveKit inference gateway
`--docs`	the `docs` dependency group

Notes

Stale tests ignored in CI before this PR

#	Module	Tests	Note
1	`test_audio_emitter.py`	10
2	`test_audio_recognition_aclose.py`	2	fixed 1 broken test
3	`test_cli_log_level.py`	25	fixed deadlock when it's enabled
4	`test_drain_timeout.py`	6
5	`test_interruption/test_interruption_failover.py`	6
6	`test_nested_agent_task.py`	1
7	`test_speaker_id_grouping.py`	7	fixed 2 broken tests
8	`test_speech_start_time_persistence.py`	3
9	`test_stt_base.py`	4
10	`test_user_turn_exceeded.py`	9

Test categories (select with --<category>):

  unit     45 modules
             - tests/test_agent_session.py
             - tests/test_aio.py
             - tests/test_aio_itertools.py
             - tests/test_amd_classifier.py
             - tests/test_audio_decoder.py
             - tests/test_audio_emitter.py
             - tests/test_audio_recognition_aclose.py
             - tests/test_audio_recognition_handoff.py
             - tests/test_audio_recognition_push_audio.py
             - tests/test_chat_ctx.py
             - tests/test_cli_log_level.py
             - tests/test_config.py
             - tests/test_connection_pool.py
             - tests/test_debounce.py
             - tests/test_drain_timeout.py
             - tests/test_endpointing.py
             - tests/test_google_thought_signatures.py
             - tests/test_http_context_helper.py
             - tests/test_inference_stt_fallback.py
             - tests/test_inference_tts_fallback.py
             - tests/test_interruption/test_interruption_failover.py
             - tests/test_interruption/test_overlapping_speech_event.py
             - tests/test_ipc.py
             - tests/test_ivr_activity.py
             - tests/test_langgraph.py
             - tests/test_language.py
             - tests/test_nested_agent_task.py
             - tests/test_recording.py
             - tests/test_room.py
             - tests/test_room_io.py
             - tests/test_schema_gemini.py
             - tests/test_session_host.py
             - tests/test_speaker_id_grouping.py
             - tests/test_speech_start_time_persistence.py
             - tests/test_stt_base.py
             - tests/test_stt_fallback.py
             - tests/test_tokenizer.py
             - tests/test_tool_proxy.py
             - tests/test_tool_search.py
             - tests/test_tools.py
             - tests/test_transcription_filter.py
             - tests/test_tts_fallback.py
             - tests/test_user_turn_exceeded.py
             - tests/test_utils/test_audio_array_buffer.py
             - tests/test_utils/test_bounded_dict.py
  plugin   15 modules
             - tests/test_personaplex_realtime_model.py
             - tests/test_plugin_anthropic.py
             - tests/test_plugin_assemblyai_stt.py
             - tests/test_plugin_cerebras.py
             - tests/test_plugin_elevenlabs_tts.py
             - tests/test_plugin_gnani_stt.py
             - tests/test_plugin_gnani_tts.py
             - tests/test_plugin_google_llm.py
             - tests/test_plugin_google_stt.py
             - tests/test_plugin_gradium_stt.py
             - tests/test_plugin_inworld_tts.py
             - tests/test_plugin_perplexity.py
             - tests/test_plugin_perplexity_responses.py
             - tests/test_plugin_soniox_stt.py
             - tests/test_vad.py
  realtime 1 module
             - tests/test_realtime/test_realtime.py
  stt      1 module
             - tests/test_stt.py
  tts      1 module
             - tests/test_tts.py
  evals    2 modules
             - tests/test_evals.py
             - tests/test_workflows.py
  docs     1 module
             - tests/test_convert_html_docs.py

New `### Testing` section in AGENTS.md (authored by Claude)

Testing

uv run pytest tests/test_tools.py                  # Run a single test file
uv run pytest tests/test_tools.py -k "test_name"   # Run specific test
uv run pytest --unit                               # Run unit tests that don't require cloud accounts

Test categories

Every test module declares exactly one category via a module-level marker, and
each category has a matching --<category> selection flag. Selection happens
before import, so a category run never imports (or fails on) modules outside
it.

Marker	Flag	Meaning
`pytest.mark.unit`	`--unit`	fast, hermetic, no external providers/credentials/network
`pytest.mark.plugin("name")`	`--plugin [name]`	provider integration test (needs that provider's deps/keys)
`pytest.mark.stt`	`--stt`	cross-provider speech-to-text suite (`tests/test_stt.py`)
`pytest.mark.tts`	`--tts`	cross-provider text-to-speech suite (`tests/test_tts.py`)
`pytest.mark.realtime("name")`	`--realtime [name]`	realtime-model test
`pytest.mark.evals`	`--evals`	behavioral evals against the LiveKit inference gateway
`pytest.mark.docs`	`--docs`	tests for the docs-build tooling under `.github/`

uv run pytest --unit                    # the CI unit gate (no cloud accounts)
uv run pytest --plugin openai           # only the openai provider tests
uv run pytest --list-categories         # list every module grouped by category, then exit

Adding a test: give the new module a category marker (pytestmark = pytest.mark.unit, etc.) — collection fails with a hint if it lacks one. Run
pytest with the --allow-uncategorized option to temporarily disable this rule
(CI keeps it on by default).

2 STT failures seem to be unrelated to the changes.

github-actions · 2026-06-02T22:21:01Z

STT Test Results

Status: ✗ Some tests failed

Metric	Count
✓ Passed	22
✗ Failed	2
× Errors	1
→ Skipped	16
▣ Total	41
⏱ Duration	285.0s

Failed Tests

tests.test_stt::test_stream[livekit.plugins.speechmatics]

def finalizer() -> None:
        """Yield again, to finalize."""
  
        async def async_finalizer() -> None:
            try:
                await gen_obj.__anext__()
            except StopAsyncIteration:
                pass
            else:
                msg = "Async generator fixture didn't stop."
                msg += "Yield only once."
                raise ValueError(msg)
  
>       runner.run(async_finalizer(), context=context)

.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:424: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <asyncio.runners.Runner object at 0x7f97400a5b80>
coro = <coroutine object _wrap_asyncgen_fixture.<locals>._asyncgen_fixture_wrapper.<locals>.finalizer.<locals>.async_finalizer at 0x7f974007c930>

    def run(self, coro, *, context=None):
        """Run a coroutine inside the embedded event loop."""
        if not coroutines.iscoroutine(coro):
            raise ValueError("a coroutine was expected, got {!r}".format(coro))
  
        if events._get_running_loop() is not None:
            # fail fast with short traceback
            raise RuntimeError(
                "Runner.run() cannot be called from a running event loop")
  
        self._lazy_init()
  
        if context is None:
            context = self._context
        task = self._loop.create_task(coro, context=context)
  
        if (threading.current_thread() is threading.main_thread()
            and signal.getsignal(signal.SIGINT) is signal.default_int_handler
        ):
            sigint_handler = functools.partial(self._on_sigint, main_task=task)
            try:
                signal.signal(signal.SIGINT, sigint_handler)
            except ValueError:
                # `signal.signal` may throw if `threading.main_thread` does
                # not support signals (e.g. embedded interpreter with signals
                # not registered - see gh-91880)
                sigint_handler = None

tests.test_stt::test_stream[livekit.plugins.nvidia]

stt_factory = <function parameter_factory.<locals>.<lambda> at 0x7f9740049d00>
request = <FixtureRequest for <Coroutine test_stream[livekit.plugins.nvidia]>>

    @pytest.mark.usefixtures("job_process")
    @pytest.mark.parametrize("stt_factory", STTs)
    async def test_stream(stt_factory: Callable[[], STT], request):
        sample_rate = SAMPLE_RATE
        plugin_id = request.node.callspec.id.split("-")[0]
        frames, transcript, _ = await make_test_speech(chunk_duration_ms=10, sample_rate=sample_rate)
  
        # TODO: differentiate missing key vs other errors
        try:
            stt_instance: STT = stt_factory()
        except ValueError as e:
            pytest.skip(f"{plugin_id}: {e}")
  
        async with stt_instance as stt:
            label = f"{stt.model}@{stt.provider}"
            if not stt.capabilities.streaming:
                pytest.skip(f"{label} does not support streaming")
  
            for attempt in range(MAX_RETRIES):
                try:
                    state = {"closing": False}
  
                    async def _stream_input(
                        frames: list[rtc.AudioFrame], stream: RecognizeStream, state: dict = state
                    ):
                        for frame in frames:
                            stream.push_frame(frame)
                            await asyncio.sleep(0.005)
  
                        stream.end_input()
                        state["closing"] = True
  
                    async def _stream_output(stream: RecognizeStream, state: dict = state):
                        text = ""
                        # make sure the events are sent in the right order
                        recv_start, recv_end = False, True
                        start_time = time.time()
                        got_final_transcript = False
  
                        async for event in stream:
                            if event.type == agents.stt.SpeechEventType.START_OF_SPEECH:

tests.test_stt::test_stream[livekit.agents.inference]

stt_factory = <function parameter_factory.<locals>.<lambda> at 0x7f974004a020>
request = <FixtureRequest for <Coroutine test_stream[livekit.agents.inference]>>

    @pytest.mark.usefixtures("job_process")
    @pytest.mark.parametrize("stt_factory", STTs)
    async def test_stream(stt_factory: Callable[[], STT], request):
        sample_rate = SAMPLE_RATE
        plugin_id = request.node.callspec.id.split("-")[0]
        frames, transcript, _ = await make_test_speech(chunk_duration_ms=10, sample_rate=sample_rate)
  
        # TODO: differentiate missing key vs other errors
        try:
            stt_instance: STT = stt_factory()
        except ValueError as e:
            pytest.skip(f"{plugin_id}: {e}")
  
        async with stt_instance as stt:
            label = f"{stt.model}@{stt.provider}"
            if not stt.capabilities.streaming:
                pytest.skip(f"{label} does not support streaming")
  
            for attempt in range(MAX_RETRIES):
                try:
                    state = {"closing": False}
  
                    async def _stream_input(
                        frames: list[rtc.AudioFrame], stream: RecognizeStream, state: dict = state
                    ):
                        for frame in frames:
                            stream.push_frame(frame)
                            await asyncio.sleep(0.005)
  
                        stream.end_input()
                        state["closing"] = True
  
                    async def _stream_output(stream: RecognizeStream, state: dict = state):
                        text = ""
                        # make sure the events are sent in the right order
                        recv_start, recv_end = False, True
                        start_time = time.time()
                        got_final_transcript = False
  
                        async for event in stream:
                            if event.type == agents.stt.SpeechEventType.START_OF_SPEECH:

Skipped Tests

Test	Reason
`tests.test_stt::test_recognize[livekit.plugins.assemblyai]`	universal-streaming-english@AssemblyAI does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.speechmatics]`	enhanced@Speechmatics does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.fireworksai]`	unknown@FireworksAI does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.cartesia]`	ink-2@Cartesia does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.soniox]`	stt-rt-v4@Soniox does not support batch recognition
`tests.test_stt::test_recognize[livekit.agents.inference]`	unknown@livekit does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.azure]`	unknown@Azure STT does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.aws]`	unknown@Amazon Transcribe does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.gradium.STT]`	unknown@Gradium does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.cartesia._legacy]`	ink-whisper@Cartesia does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.deepgram.STTv2]`	flux-general-en@Deepgram does not support batch recognition
`tests.test_stt::test_stream[livekit.plugins.elevenlabs]`	scribe_v1@ElevenLabs does not support streaming
`tests.test_stt::test_stream[livekit.plugins.mistralai]`	voxtral-mini-latest@MistralAI does not support streaming
`tests.test_stt::test_stream[livekit.plugins.fal]`	Wizper@Fal does not support streaming
`tests.test_stt::test_stream[livekit.plugins.openai]`	gpt-4o-mini-transcribe@api.openai.com does not support streaming
`tests.test_stt::test_recognize[livekit.plugins.nvidia]`	unknown@unknown does not support batch recognition

Triggered by workflow run #2249

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

… list The `unit-tests` make target hardcoded a list of ~35 test files. That list drifts as tests are added, and it existed only because `pytest tests/` cannot be used directly: collecting the full tree imports every module, and some modules (e.g. test_convert_html_docs -> bs4, provider plugin tests) fail to import without optional/cloud dependencies, crashing collection. Mark unit modules with `pytestmark = pytest.mark.unit` and add a `--unit` pytest flag (tests/conftest.py) that filters collection *statically* via `pytest_ignore_collect` -- it reads each file's text and skips non-unit modules before they are imported, so their dependencies are never required. - mark the 35 existing unit modules with `pytestmark = pytest.mark.unit` - add `--unit` option + static `pytest_ignore_collect` hook in conftest - register the `unit` marker in pyproject.toml - `makefile` / `tests/Makefile`: `unit-tests` now runs `pytest --unit tests/` New unit tests opt in with one line; selection no longer needs a maintained list and can never import a module it didn't intend to. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Extends the static-collection scheme from --unit to all test categories, consolidating the selection logic that was split across ./makefile and ./tests/Makefile into pytest marker flags. Categories (one marker per module; provider-bearing ones take an arg): unit | plugin("<provider>") | realtime | stt | tts | evals - conftest: --unit/--plugin/--realtime/--stt/--tts/--evals flags. * pre-import skip (substring scan, no regex) keeps each flag from importing other categories' modules / optional deps; benchmarked ~2x faster than re. * provider arg filtering via iter_markers() deselection, e.g. --plugin google. * pytest_ignore_collect now returns None (not False) when keeping a module, so --ignore and other plugins still compose. - register all six markers in pyproject.toml; testpaths=["tests"] so the category flags parse without a positional path eating the flag value. - add beautifulsoup4 + markdownify to the dev group (makes test_convert_html_docs importable as a unit test); uv.lock dev manifest updated minimally. - test_llm.py left unmarked (fully commented out, 0 live tests). NOTE: --unit membership across the 12 modules the old 35-list omitted is still under review (two have pre-existing failures); not yet finalized. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

…ures) Investigated all 48 unit-marked modules per-module (orig 35-list vs the 13 folded in). The sandbox is not a clean baseline — orig35 also has env-only non-passes (test_room needs a livekit-server; test_agent_session needs OPENAI_API_KEY) — so failures were classified by *nature*, not exit code: NEW, green in CI (kept as unit): test_audio_emitter, test_cli_log_level, test_convert_html_docs, test_drain_timeout, test_interruption_failover, test_nested_agent_task, test_personaplex_realtime_model, test_speech_start_time_persistence, test_stt_base, test_user_turn_exceeded, test_vad (only fails here on an unfetched git-LFS silero model). NEW, genuine deterministic bugs (quarantined — left unmarked, like test_llm): - test_speaker_id_grouping: trailing-whitespace assertion + re.match(None) TypeError. - test_audio_recognition_aclose: aclose() hits self._stt_pipeline on a partially-initialized object -> AttributeError. --unit now selects 46 modules. The 2 quarantined modules are real pre-existing failures (never gated by the old 35-list); they need a separate fix before joining the gate. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Categorization (axis = what a test NEEDS to run) is now explicit and complete: every test module carries a category marker and each category has a --<category> selection flag. Health (pass/xfail) is kept as a separate axis. - Mark all: fix test_personaplex_realtime_model (unit -> realtime("nvidia"), it imports the nvidia/sphn plugin) and test_vad (unit -> plugin("silero"), it loads the ONNX model at import so it isn't hermetic). Mark the two previously unmarked modules as unit. test_llm is fully commented out (0 tests) and stays exempt. - xfail (strict) the genuine, environment-independent failures instead of hiding them: test_speaker_id_grouping (trailing-whitespace expectation + re.match(None) TypeError) and test_audio_recognition_aclose (stale mock missing _stt_pipeline). strict=True flips CI red when the bug is fixed. - Enforce categorization: collection fails with a clear fix hint if a module with tests has no category marker. Escape hatch --allow-uncategorized for local dev only (default = enforce, so CI stays strict). - Add `pytest --list-categories` (import-free) listing modules per category. - Translate file-path pytest invocations to category flags so runners and the markers can't drift: tests/Makefile (--tts, --realtime), tests.yml evals (--evals), test-stt.yml (--stt), test-realtime.yml (--realtime). Left the blockguard package test and examples/ eval as intentionally separate. - Document the category system in AGENTS.md. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Drop the "module(s)" placeholder for proper singular/plural via a small _plural() helper, and narrow the category column. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Name promised an iterator but it materialized a list; yield lazily instead. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Use a triple-quoted block for the marker hint, and describe --allow-uncategorized as temporarily disabling the rule (in the hint, the option help, and AGENTS.md) instead of framing it as a never-on-CI escape hatch. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Keep the marker comment factual and state the xfail reasons as the concrete failure, not commentary. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

test_convert_html_docs tests .github/convert_html_docs.py (a pdoc3 HTML->markdown script), not the library. Give it its own `docs` category instead of `unit`. Its deps (beautifulsoup4, markdownify) already live in the `docs` dependency group and in the lockfile, so the redundant `dev` entries are dropped and uv.lock is restored to main (the earlier diff was only a lockfile format-revision reformat). https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

It exercises the nvidia plugin's RealtimeModel with no credentials or network (import + logic only), so it belongs with provider plugin tests, not the live realtime suite. This also keeps --realtime scoped to tests/test_realtime/. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

The target used to run everything except tests/test_realtime/, test_stt.py and test_tts.py; switching it to --unit silently dropped the plugin and evals tests. Select the same set via categories: --unit --plugin --evals --docs. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Read category names from the [tool.pytest.ini_options] markers in pyproject.toml instead of duplicating them in conftest. config.getini() isn't usable here because the --<category> options are registered in pytest_addoption, which runs before pytest parses the ini; so the markers block is read from the file (the same source pytest uses). The enforcement hint is generated from the list too, so it can't drift. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

cli.log.setup_logging() (run by the AgentServer/CLI tests now gated as --unit, e.g. test_cli_log_level and test_drain_timeout) installs handlers on the root and `livekit` loggers process-wide and never restores them. Leaked, that state deadlocks test_ipc::test_slow_initialization, which streams spawned-worker logs through those loggers. Snapshot the root + livekit loggers once per session and revert them after every test via an autouse fixture. Cost is a few attribute reads per test (no threads, no I/O). https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

…nai") New module from main; it builds an openai RealtimeModel and inspects the session-update payload without ever connecting (fake key, no network), so it's a provider plugin test like the other test_plugin_* modules. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Resolve the deterministic test bugs the strict-xfails flagged, so they pass: - test_speaker_id_grouping: strip the fragment text before wrapping, and guard a None speaker_id before the ignore-pattern re.match (both in the test's own local helpers). - test_audio_recognition_aclose: the hand-built mock bypassed __init__ and had gone stale w.r.t. aclose(); set the attributes aclose() actually touches (_stt_pipeline, _stt_consumer_atask, _interruption_atask, _backchannel_boundary_timer) and drop the unused _stt_atask. Removes the three @pytest.mark.xfail(strict=True) markers. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

New module from main (#5947); it covers get_inference_headers with fakes (no network or credentials), so it's a hermetic unit test. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Bobronium · 2026-06-03T11:08:41Z

Rebased on main and added pytest.mark.unit to test_inference_utils (b893ca2)

…i`/ … flags (#5945) Co-authored-by: Claude <noreply@anthropic.com>

chenghao-mou requested a review from a team June 2, 2026 22:16

Bobronium changed the title ~~Categorize tests and allow selecting by running pytest --unit/--stt/--plugin [openai]/etc.~~ test: categorize tests and select via pytest --unit/--plugin openai/ … flags Jun 2, 2026

devin-ai-integration Bot reviewed Jun 2, 2026

View reviewed changes

Bobronium marked this pull request as draft June 2, 2026 23:52

Bobronium force-pushed the arseny/categorize-tests branch from a83ccac to ad6a5ee Compare June 3, 2026 00:25

Bobronium marked this pull request as ready for review June 3, 2026 01:30

theomonnom approved these changes Jun 3, 2026

View reviewed changes

Bobronium and others added 18 commits June 3, 2026 10:44

test: make _iter_test_files a generator

f854f77

Name promised an iterator but it materialized a list; yield lazily instead. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

test: drop editorializing comments, tighten xfail reasons

d35a50d

Keep the marker comment factual and state the xfail reasons as the concrete failure, not commentary. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

docs: drop xfail-convention note from AGENTS.md

3bf6a05

https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

docs: add the docs category to the AGENTS.md table

3ef524e

https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

test: categorize test_inference_utils as unit

b893ca2

New module from main (#5947); it covers get_inference_headers with fakes (no network or credentials), so it's a hermetic unit test. https://claude.ai/code/session_01XvWkVuQVX9kJf3gn9cRXC5 Co-authored-by: Claude <noreply@anthropic.com>

Bobronium force-pushed the arseny/categorize-tests branch from b59db1d to b893ca2 Compare June 3, 2026 10:45

Bobronium merged commit daeacfe into main Jun 3, 2026
26 checks passed

Bobronium deleted the arseny/categorize-tests branch June 3, 2026 11:09

detail-app Bot mentioned this pull request Jun 3, 2026

docs: update test commands to use pytest category flags #5958

Merged

longcw pushed a commit that referenced this pull request Jun 4, 2026

test: categorize tests and select via pytest --unit/`--plugin opena…

b855d20

…i`/ … flags (#5945) Co-authored-by: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: categorize tests and select via `pytest --unit`/`--plugin openai`/ … flags#5945

test: categorize tests and select via `pytest --unit`/`--plugin openai`/ … flags#5945
Bobronium merged 18 commits into
mainfrom
arseny/categorize-tests

Bobronium commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Bobronium commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bobronium commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR: Add explicit source of truth for tests categorization and unified selection

Motivation

Changes

Notes

Stale tests ignored in CI before this PR

pytest --list-categories output

New ### Testing section in AGENTS.md (authored by Claude)

Testing

Test categories

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

STT Test Results

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Bobronium commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bobronium commented Jun 2, 2026 •

edited

Loading

`pytest --list-categories` output

New `### Testing` section in AGENTS.md (authored by Claude)

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Bobronium commented Jun 3, 2026 •

edited

Loading