Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt

Use this file to discover all available pages before exploring further.

The Audio Track trigger fires once for every audio chunk a twin’s microphone publishes on the edge data bus. Each chunk is typically 20 ms of PCM audio — the workflow receives an infinite stream of small frames, not a single large buffer.
Audio Track does not limit the total stream length. A conference call, book reading, or any other continuous audio source runs indefinitely — each 20 ms chunk triggers the workflow independently.

Pipeline Position

Microphone (edge) → Zenoh data bus → Audio Track Trigger → [Audio Assistant / Call Model / ...]
The Audio Track trigger is edge-only. It hooks into the Cyberwave SDK’s @cw.on_audio(twin_uuid) decorator and receives decoded PCM samples from the local Zenoh data bus — no cloud round-trip.

Quick Start

  1. Add an Audio Track trigger node to your workflow
  2. Select the Twin whose microphone you want to stream
  3. (Optional) Set the Audio Track / Sensor ID if the twin has multiple microphones
  4. Connect to a downstream node — typically an Audio Assistant for VAD, or a Call Model node for direct STT
  5. Activate the workflow and sync to the edge device

Inputs (Configuration)

ParameterLabelDefaultRequiredDescription
twin_uuidTwin UUIDYesThe twin whose audio stream triggers the workflow.
audio_track_idAudio Track / Sensor ID"default"NoSensor identifier on the twin. Use "default" for the primary microphone. Only needed when a twin has multiple audio sensors.
sample_rate_hzExpected Sample Rate16000NoExpected sample rate in Hz. The actual rate from wire metadata overrides this at runtime, so you rarely need to change it.
channelsChannels1NoExpected channel count (1 = mono, 2 = stereo). Wire metadata overrides at runtime.
buffer_presetBuffer Preset"vad"NoFIFO buffer mode: vad (32 ms), wake-word (80 ms), stt (4 s), or custom. See Buffer Presets.
buffer_size_sBuffer Size (s)1.0NoCustom buffer duration. Only used when preset is custom.
min_samplesSkip Empty Audio1NoMinimum samples a chunk must contain. Drops empty frames from device glitches. See Safety Guards.
max_chunk_secondsMax Single Chunk Length (s)10NoMaximum duration of a single chunk in seconds. Drops oversized buffers from device stalls. See Safety Guards.
For most use cases — including long conference calls, book readings, and continuous monitoring — the defaults work out of the box. You do not need to change any advanced settings.

Outputs

Every time the trigger fires, it produces these outputs for downstream nodes:
OutputTypeDescription
audioAUDIOPCM S16LE int16 numpy array, mono, 16 kHz. Standard format for zero-copy pass-by-reference between nodes.
audio_tsNUMBERTimestamp of the audio sample from the data bus (epoch seconds).
sensorSTRINGName of the audio sensor that produced the sample.
sample_rate_hzNUMBERAlways 16000 Hz after resampling.
channelsNUMBERAlways 1 (mono) after downmix.
sample_countNUMBERNumber of audio samples in this chunk.
duration_sNUMBERDuration of this chunk in seconds (derived from sample_count / sample_rate_hz).
metadataOBJECTFull transport metadata from the data bus (sample_rate_hz, channels, encoding, content_type, etc.).
See Audio in Workflows for the full audio format specification.

Buffer Presets

The microphone sends audio in tiny 20 ms chunks. Most downstream nodes need a larger buffer to work correctly. The Buffer Preset parameter controls how many chunks are accumulated before emitting:
PresetDurationSamples (@ 16 kHz)Use Case
vad32 ms512Audio Assistant / Silero VAD (default)
wake-word80 ms1,280Wake Word Engine (OpenWakeWord internal frame size)
stt4 s64,000Whisper / STT models
customUser-definedint(16000 × buffer_size_s)Any custom duration
Choose Wake Word Engine (80 ms) when feeding directly into a Wake Word Engine node. This matches OpenWakeWord’s internal processing frame of 1280 samples, giving optimal detection latency with no internal re-buffering.

Supported Sample Rates

The Audio Track trigger is fully sample-rate-agnostic. The microphone driver sends 20 ms chunks regardless of the sample rate — only the number of samples per chunk changes:
Sample RateSamples per 20 ms chunkCommon Devices
48,000 Hz960USB microphones, most laptops
44,100 Hz882CD-quality audio interfaces
32,000 Hz640Some embedded boards, Bluetooth profiles
16,000 Hz320Telephony, low-bandwidth IoT devices
At runtime, the node reads the actual sample rate from wire metadata and uses it for all duration calculations. If the wire rate differs from the configured sample_rate_hz, a warning is logged and the wire value takes precedence.

Wire Metadata Validation

When the actual microphone sample rate or channel count differs from what you configured, the Audio Track node:
  1. Logs a warning with both the configured and actual values
  2. Publishes a twin alert (audio_track_mismatch, severity warning) so you can see the mismatch in the Cyberwave UI
  3. Uses the actual wire value for all processing — the configured value is only a fallback when metadata is missing
This means you can leave sample_rate_hz at the default 16000 and the node will still work correctly with a 48 kHz microphone — it just generates a one-time mismatch warning.

Safety Guards

The Audio Track trigger includes two optional guards that filter out degenerate chunks before they reach downstream nodes. Both are under Advanced settings in the inspector.

Skip Empty Audio (min_samples)

Default: 1 — accept everything except completely empty (0-sample) frames. Empty frames are never intentional audio. They occur when:
  • A USB microphone briefly disconnects and reconnects
  • The audio driver has a buffer underrun (CPU spike, thermal throttling)
  • A Bluetooth mic drops a packet
  • The container starts before the hardware stream is fully open
A value of 1 filters these out without rejecting any real audio. You almost never need to change this.

Max Single Chunk Length (max_chunk_seconds)

Default: 10 seconds — reject any individual chunk longer than 10 s. Normal chunks are ~20 ms. An oversized chunk only appears when the device stalls and then flushes a large backlog at once. Sending such a chunk into downstream processing (VAD, STT) could cause memory spikes or crashes.
This does not limit total stream duration. The workflow receives chunks indefinitely — this guard only catches abnormally large individual buffers.

Typical Pipelines

Voice Assistant (VAD + STT)

Audio Track → Audio Assistant (streaming) → Call Model (Whisper)
The Audio Assistant uses Silero VAD to detect speech boundaries in the continuous chunk stream and outputs clean utterances for STT. It handles sample rate conversion (resampling to 16 kHz) and stereo-to-mono downmixing automatically.

Direct STT (no VAD)

Audio Track → Call Model (Whisper)
Every chunk hits the STT model — including silence and noise. Only useful for short, command-style audio where you know speech is always present.

Wake Word + STT

Audio Track (wake-word preset) → Wake Word Engine → Call Model (Whisper)
The Audio Track emits 80 ms chunks (1280 samples) which match OpenWakeWord’s internal frame size. The Wake Word Engine detects the trigger phrase and then streams buffered audio chunks to Whisper for transcription.

Acoustic Monitoring

Audio Track → Audio Assistant (Sound Security Guard) → Send Alert
Future pipeline for detecting glass breaks, alarms, or distress calls using acoustic event detection (AED). SSG mode is currently a skeleton — audio is forwarded unchanged.

Edge-Only Execution

The Audio Track trigger generates a Python worker function that runs directly on the edge device:
@cw.on_audio("twin-uuid", sensor="audio")
def run(audio, ctx, client=None):
    # audio: numpy array (PCM samples)
    # ctx.metadata: wire transport metadata
    # ctx.timestamp: sample timestamp
    # ctx.sensor_name: sensor identifier
    ...
The worker is compiled by the workflow code assembler and synced to the edge via cyberwave workflow sync or automatic periodic sync. No audio data leaves the device unless a downstream node explicitly sends it (e.g. to a cloud STT API).

Next Steps

Audio Assistant

VAD-powered speech segmentation for the audio stream

Audio Assistant Technical Reference

Output schema, resampling, and architecture details