Skip to main content
Voice Assistant uses Silero VAD to detect speech start/end, extract utterances, and optionally re-chunk output for downstream nodes.

Sub-modality profiles

ProfileUse case
Real-Time Voice AssistantLow-latency conversational / command endpoints
High-Noise / IndustrialFactory floors, vehicles, call centres
Batch Transcription / STTOffline Whisper-style segmentation (~28 s max chunk)
Quiet Studio / WhispererDistant mic, soft speech, long pauses
CustomManual Silero parameters

Input / output

Input: audio from Audio Track (any supported encoding; adapted to int16 @ 16 kHz mono). Outputs (every chunk while listening):
{
  "speech_probability": 0.87,
  "is_speaking": true,
  "sample_rate_hz": 16000,
  "channels": 1
}
Outputs (when speech ends):
{
  "audio": "<numpy int16>",
  "speech_probability": 0.92,
  "is_speaking": false,
  "start_timestamp_sec": 1.2,
  "end_timestamp_sec": 3.8,
  "sample_rate_hz": 16000,
  "channels": 1
}

Output buffer preset (optional)

After a full utterance is detected, audio can be re-chunked for downstream consumers:
PresetDurationSamples @ 16 kHz
NoneFull segment
Wake Word Engine80 ms1280
Speech-To-Text4 s64000
CustomUser-defined
Set the upstream Audio Track buffer preset to Voice Assistant (32 ms) so chunks are 512 samples—the native Silero frame size. Larger chunks still work (the node reframes internally) but add latency.