feat(soniox): support stt-rt-v5 with endpoint_sensitivity option#6126
Conversation
2000ms is the Soniox API's own default and works well in practice. The previous value of 500ms, the API minimum, is too aggressive and can cause word recognition issues when the model finalizes tokens too early.
| enable_language_identification: bool = True | ||
|
|
||
| max_endpoint_delay_ms: int = 500 | ||
| max_endpoint_delay_ms: int = 2000 |
There was a problem hiding this comment.
🚩 Breaking default change: max_endpoint_delay_ms 500 → 2000
The default max_endpoint_delay_ms changed from 500 to 2000. This is a 4× increase in the maximum endpoint detection delay, meaning existing users who rely on the default will experience noticeably later speech finalization. While this appears intentional for the v5 model, it is a behavioral breaking change for any caller that constructs STTOptions() without explicitly setting this field. The livekit-plugins-inworld plugin references soniox/stt-rt-v4 in comments (livekit-plugins/livekit-plugins-inworld/livekit/plugins/inworld/stt.py:55) — that plugin may also need updating if it depends on these defaults.
Was this helpful? React with 👍 or 👎 to provide feedback.
tinalenguyen
left a comment
There was a problem hiding this comment.
lgtm, left a small comment
| @@ -121,17 +119,25 @@ class STTOptions: | |||
| enable_speaker_diarization: bool = False | |||
| enable_language_identification: bool = True | |||
|
|
|||
| max_endpoint_delay_ms: int = 500 | |||
| max_endpoint_delay_ms: int = 2000 | |||
There was a problem hiding this comment.
@mihafabcic-soniox could you provide more context on why you changed the default here?
There was a problem hiding this comment.
Some context is in the PR description. The change came from a real customer who hit transcription issues from endpoints firing too aggressively at 500ms (e.g. extra digits added to phone numbers). Bumping to 2000ms (the Soniox API's own default) resolved them.
Updates the LiveKit Soniox plugin for the v5 model.
endpoint_sensitivitytoSTTOptions(float | None, range-1.0to1.0). Controls how quickly the model commits endpoints. Higher values finalize sooner. Only supported by v5; earlier models reject the field. Skipped on the wire whenNoneso the server uses its default.stt-rt-v5.max_endpoint_delay_msraised from500(the API minimum) to2000. The old default was too aggressive on phone-call audio: short pauses between word or digit groups would cause Soniox to finalize a segment too early, before the model had enough context.2000matches the Soniox API's own default.