Voice Assistant (VA) - Cyberwave Docs

Voice Assistant uses Silero VAD to detect speech start/end, extract utterances, and optionally re-chunk output for downstream nodes.

Sub-modality profiles

Profile	Use case
Real-Time Voice Assistant	Low-latency conversational / command endpoints
High-Noise / Industrial	Factory floors, vehicles, call centres
Batch Transcription / STT	Offline Whisper-style segmentation (~28 s max chunk)
Quiet Studio / Whisperer	Distant mic, soft speech, long pauses
Custom	Manual Silero parameters

Input / output

Input: audio from Audio Track (any supported encoding; adapted to int16 @ 16 kHz mono). Outputs (every chunk while listening):

{
  "speech_probability": 0.87,
  "is_speaking": true,
  "sample_rate_hz": 16000,
  "channels": 1
}

Outputs (when speech ends):

{
  "audio": "<numpy int16>",
  "speech_probability": 0.92,
  "is_speaking": false,
  "start_timestamp_sec": 1.2,
  "end_timestamp_sec": 3.8,
  "sample_rate_hz": 16000,
  "channels": 1
}

Output buffer preset (optional)

After a full utterance is detected, audio can be re-chunked for downstream consumers:

Preset	Duration	Samples @ 16 kHz
None	Full segment	—
Wake Word Engine	80 ms	1280
Speech-To-Text	4 s	64000
Custom	User-defined	—

Recommended Audio Track preset

Set the upstream Audio Track buffer preset to Voice Assistant (32 ms) so chunks are 512 samples—the native Silero frame size. Larger chunks still work (the node reframes internally) but add latency.

​Sub-modality profiles

​Input / output

​Output buffer preset (optional)

​Recommended Audio Track preset

Sub-modality profiles

Input / output

Output buffer preset (optional)

Recommended Audio Track preset