Stream audio from an edge microphone into your workflow — one chunk at a time.
The Audio Track trigger fires once for every audio chunk a twin’s microphone publishes on the edge data bus. Each chunk is typically 20 ms of PCM audio — the workflow receives an infinite stream of small frames, not a single large buffer.
Audio Track does not limit the total stream length. A conference call, book reading, or any other continuous audio source runs indefinitely — each 20 ms chunk triggers the workflow independently.
Microphone (edge) → Zenoh data bus → Audio Track Trigger → [Audio Assistant / Call Model / ...]
The Audio Track trigger is edge-only. It hooks into the Cyberwave SDK’s @cw.on_audio(twin_uuid) decorator and receives decoded PCM samples from the local Zenoh data bus — no cloud round-trip.
FIFO buffer mode: vad (32 ms), wake-word (80 ms), stt (4 s), or custom. See Buffer Presets.
buffer_size_s
Buffer Size (s)
1.0
No
Custom buffer duration. Only used when preset is custom.
min_samples
Skip Empty Audio
1
No
Minimum samples a chunk must contain. Drops empty frames from device glitches. See Safety Guards.
max_chunk_seconds
Max Single Chunk Length (s)
10
No
Maximum duration of a single chunk in seconds. Drops oversized buffers from device stalls. See Safety Guards.
For most use cases — including long conference calls, book readings, and continuous monitoring — the defaults work out of the box. You do not need to change any advanced settings.
The microphone sends audio in tiny 20 ms chunks. Most downstream nodes need a larger buffer to work correctly. The Buffer Preset parameter controls how many chunks are accumulated before emitting:
Preset
Duration
Samples (@ 16 kHz)
Use Case
vad
32 ms
512
Audio Assistant / Silero VAD (default)
wake-word
80 ms
1,280
Wake Word Engine (OpenWakeWord internal frame size)
stt
4 s
64,000
Whisper / STT models
custom
User-defined
int(16000 × buffer_size_s)
Any custom duration
Choose Wake Word Engine (80 ms) when feeding directly into a Wake Word Engine node. This matches OpenWakeWord’s internal processing frame of 1280 samples, giving optimal detection latency with no internal re-buffering.
The Audio Track trigger is fully sample-rate-agnostic. The microphone driver sends 20 ms chunks regardless of the sample rate — only the number of samples per chunk changes:
Sample Rate
Samples per 20 ms chunk
Common Devices
48,000 Hz
960
USB microphones, most laptops
44,100 Hz
882
CD-quality audio interfaces
32,000 Hz
640
Some embedded boards, Bluetooth profiles
16,000 Hz
320
Telephony, low-bandwidth IoT devices
At runtime, the node reads the actual sample rate from wire metadata and uses it for all duration calculations. If the wire rate differs from the configured sample_rate_hz, a warning is logged and the wire value takes precedence.
When the actual microphone sample rate or channel count differs from what you configured, the Audio Track node:
Logs a warning with both the configured and actual values
Publishes a twin alert (audio_track_mismatch, severity warning) so you can see the mismatch in the Cyberwave UI
Uses the actual wire value for all processing — the configured value is only a fallback when metadata is missing
This means you can leave sample_rate_hz at the default 16000 and the node will still work correctly with a 48 kHz microphone — it just generates a one-time mismatch warning.
The Audio Track trigger includes two optional guards that filter out degenerate chunks before they reach downstream nodes. Both are under Advanced settings in the inspector.
Default: 10 seconds — reject any individual chunk longer than 10 s.Normal chunks are ~20 ms. An oversized chunk only appears when the device stalls and then flushes a large backlog at once. Sending such a chunk into downstream processing (VAD, STT) could cause memory spikes or crashes.
This does not limit total stream duration. The workflow receives chunks indefinitely — this guard only catches abnormally large individual buffers.
Audio Track → Audio Assistant (streaming) → Call Model (Whisper)
The Audio Assistant uses Silero VAD to detect speech boundaries in the continuous chunk stream and outputs clean utterances for STT. It handles sample rate conversion (resampling to 16 kHz) and stereo-to-mono downmixing automatically.
Audio Track (wake-word preset) → Wake Word Engine → Call Model (Whisper)
The Audio Track emits 80 ms chunks (1280 samples) which match OpenWakeWord’s internal frame size. The Wake Word Engine detects the trigger phrase and then streams buffered audio chunks to Whisper for transcription.
Future pipeline for detecting glass breaks, alarms, or distress calls using acoustic event detection (AED). SSG mode is currently a skeleton — audio is forwarded unchanged.
The worker is compiled by the workflow code assembler and synced to the edge via cyberwave workflow sync or automatic periodic sync. No audio data leaves the device unless a downstream node explicitly sends it (e.g. to a cloud STT API).