Audio Track Trigger

The Audio Track trigger fires once for every audio chunk a twin’s microphone publishes on the edge data bus. Each chunk is typically 20 ms of PCM audio — the workflow receives an infinite stream of small frames, not a single large buffer.

Audio Track does not limit the total stream length. A conference call, book reading, or any other continuous audio source runs indefinitely — each 20 ms chunk triggers the workflow independently.

Pipeline Position

Microphone (edge) → Zenoh data bus → Audio Track Trigger → [Audio Assistant / Call Model / ...]

The Audio Track trigger is edge-only. It hooks into the Cyberwave SDK’s @cw.on_audio(twin_uuid) decorator and receives decoded PCM samples from the local Zenoh data bus — no cloud round-trip.

Quick Start

Add an Audio Track trigger node to your workflow
Select the Twin whose microphone you want to stream
(Optional) Set the Audio Track / Sensor ID if the twin has multiple microphones
Connect to a downstream node — typically an Audio Assistant for VAD, or a Call Model node for direct STT
Activate the workflow and sync to the edge device

Inputs (Configuration)

Parameter	Label	Default	Required	Description
`twin_uuid`	Twin UUID	—	Yes	The twin whose audio stream triggers the workflow.
`audio_track_id`	Audio Track / Sensor ID	`"default"`	No	Sensor identifier on the twin. Use `"default"` for the primary microphone. Only needed when a twin has multiple audio sensors.
`sample_rate_hz`	Expected Sample Rate	`16000`	No	Expected sample rate in Hz. The actual rate from wire metadata overrides this at runtime, so you rarely need to change it.
`channels`	Channels	`1`	No	Expected channel count (1 = mono, 2 = stereo). Wire metadata overrides at runtime.
`buffer_preset`	Buffer Preset	`"vad"`	No	FIFO buffer mode: `vad` (32 ms), `wake-word` (80 ms), `stt` (4 s), or `custom`. See Buffer Presets.
`buffer_size_s`	Buffer Size (s)	`1.0`	No	Custom buffer duration. Only used when preset is `custom`.
`min_samples`	Skip Empty Audio	`1`	No	Minimum samples a chunk must contain. Drops empty frames from device glitches. See Safety Guards.
`max_chunk_seconds`	Max Single Chunk Length (s)	`10`	No	Maximum duration of a single chunk in seconds. Drops oversized buffers from device stalls. See Safety Guards.

For most use cases — including long conference calls, book readings, and continuous monitoring — the defaults work out of the box. You do not need to change any advanced settings.

Outputs

Every time the trigger fires, it produces these outputs for downstream nodes:

Output	Type	Description
`audio`	AUDIO	PCM S16LE int16 numpy array, mono, 16 kHz. Standard format for zero-copy pass-by-reference between nodes.
`audio_ts`	NUMBER	Timestamp of the audio sample from the data bus (epoch seconds).
`sensor`	STRING	Name of the audio sensor that produced the sample.
`sample_rate_hz`	NUMBER	Always 16000 Hz after resampling.
`channels`	NUMBER	Always 1 (mono) after downmix.
`sample_count`	NUMBER	Number of audio samples in this chunk.
`duration_s`	NUMBER	Duration of this chunk in seconds (derived from `sample_count / sample_rate_hz`).
`metadata`	OBJECT	Full transport metadata from the data bus (`sample_rate_hz`, `channels`, `encoding`, `content_type`, etc.).

See Audio in Workflows for the full audio format specification.

Buffer Presets

The microphone sends audio in tiny 20 ms chunks. Most downstream nodes need a larger buffer to work correctly. The Buffer Preset parameter controls how many chunks are accumulated before emitting:

Preset	Duration	Samples (@ 16 kHz)	Use Case
`vad`	32 ms	512	Audio Assistant / Silero VAD (default)
`wake-word`	80 ms	1,280	Wake Word Engine (OpenWakeWord internal frame size)
`stt`	4 s	64,000	Whisper / STT models
`custom`	User-defined	`int(16000 × buffer_size_s)`	Any custom duration

Choose Wake Word Engine (80 ms) when feeding directly into a Wake Word Engine node. This matches OpenWakeWord’s internal processing frame of 1280 samples, giving optimal detection latency with no internal re-buffering.

Supported Sample Rates

The Audio Track trigger is fully sample-rate-agnostic. The microphone driver sends 20 ms chunks regardless of the sample rate — only the number of samples per chunk changes:

Sample Rate	Samples per 20 ms chunk	Common Devices
48,000 Hz	960	USB microphones, most laptops
44,100 Hz	882	CD-quality audio interfaces
32,000 Hz	640	Some embedded boards, Bluetooth profiles
16,000 Hz	320	Telephony, low-bandwidth IoT devices

At runtime, the node reads the actual sample rate from wire metadata and uses it for all duration calculations. If the wire rate differs from the configured sample_rate_hz, a warning is logged and the wire value takes precedence.

Wire Metadata Validation

When the actual microphone sample rate or channel count differs from what you configured, the Audio Track node:

Logs a warning with both the configured and actual values
Publishes a twin alert (audio_track_mismatch, severity warning) so you can see the mismatch in the Cyberwave UI
Uses the actual wire value for all processing — the configured value is only a fallback when metadata is missing

This means you can leave sample_rate_hz at the default 16000 and the node will still work correctly with a 48 kHz microphone — it just generates a one-time mismatch warning.

Safety Guards

The Audio Track trigger includes two optional guards that filter out degenerate chunks before they reach downstream nodes. Both are under Advanced settings in the inspector.

Skip Empty Audio (`min_samples`)

Default: 1 — accept everything except completely empty (0-sample) frames. Empty frames are never intentional audio. They occur when:

A USB microphone briefly disconnects and reconnects
The audio driver has a buffer underrun (CPU spike, thermal throttling)
A Bluetooth mic drops a packet
The container starts before the hardware stream is fully open

A value of 1 filters these out without rejecting any real audio. You almost never need to change this.

Max Single Chunk Length (`max_chunk_seconds`)

Default: 10 seconds — reject any individual chunk longer than 10 s. Normal chunks are ~20 ms. An oversized chunk only appears when the device stalls and then flushes a large backlog at once. Sending such a chunk into downstream processing (VAD, STT) could cause memory spikes or crashes.

This does not limit total stream duration. The workflow receives chunks indefinitely — this guard only catches abnormally large individual buffers.

Typical Pipelines

Voice Assistant (VAD + STT)

Audio Track → Audio Assistant (streaming) → Call Model (Whisper)

The Audio Assistant uses Silero VAD to detect speech boundaries in the continuous chunk stream and outputs clean utterances for STT. It handles sample rate conversion (resampling to 16 kHz) and stereo-to-mono downmixing automatically.

Direct STT (no VAD)

Audio Track → Call Model (Whisper)

Every chunk hits the STT model — including silence and noise. Only useful for short, command-style audio where you know speech is always present.

Wake Word + STT

Audio Track (wake-word preset) → Wake Word Engine → Call Model (Whisper)

The Audio Track emits 80 ms chunks (1280 samples) which match OpenWakeWord’s internal frame size. The Wake Word Engine detects the trigger phrase and then streams buffered audio chunks to Whisper for transcription.

Acoustic Monitoring

Audio Track → Audio Assistant (Sound Security Guard) → Send Alert

Future pipeline for detecting glass breaks, alarms, or distress calls using acoustic event detection (AED). SSG mode is currently a skeleton — audio is forwarded unchanged.

Edge-Only Execution

The Audio Track trigger generates a Python worker function that runs directly on the edge device:

@cw.on_audio("twin-uuid", sensor="audio")
def run(audio, ctx, client=None):
    # audio: numpy array (PCM samples)
    # ctx.metadata: wire transport metadata
    # ctx.timestamp: sample timestamp
    # ctx.sensor_name: sensor identifier
    ...

The worker is compiled by the workflow code assembler and synced to the edge via cyberwave workflow sync or automatic periodic sync. No audio data leaves the device unless a downstream node explicitly sends it (e.g. to a cloud STT API).

Concepts

Platform Features

Cyberwave Edge

Technical Reference

Use-Case Recipes

Pipeline Position

Quick Start

Inputs (Configuration)

Outputs

Buffer Presets

Supported Sample Rates

Wire Metadata Validation

Safety Guards

Skip Empty Audio (`min_samples`)

Max Single Chunk Length (`max_chunk_seconds`)

Typical Pipelines

Voice Assistant (VAD + STT)

Direct STT (no VAD)

Wake Word + STT

Acoustic Monitoring

Edge-Only Execution

Next Steps

Audio Assistant

Audio Assistant Technical Reference

Concepts

Platform Features

Cyberwave Edge

Technical Reference

Use-Case Recipes

Documentation Index

​Pipeline Position

​Quick Start

​Inputs (Configuration)

​Outputs

​Buffer Presets

​Supported Sample Rates

​Wire Metadata Validation

​Safety Guards

​Skip Empty Audio (min_samples)

​Max Single Chunk Length (max_chunk_seconds)

​Typical Pipelines

​Voice Assistant (VAD + STT)

​Direct STT (no VAD)

​Wake Word + STT

​Acoustic Monitoring

​Edge-Only Execution

​Next Steps

Audio Assistant

Audio Assistant Technical Reference

Pipeline Position

Quick Start

Inputs (Configuration)

Outputs

Buffer Presets

Supported Sample Rates

Wire Metadata Validation

Safety Guards

Skip Empty Audio (`min_samples`)

Max Single Chunk Length (`max_chunk_seconds`)

Typical Pipelines

Voice Assistant (VAD + STT)

Direct STT (no VAD)

Wake Word + STT

Acoustic Monitoring

Edge-Only Execution

Next Steps