All Cyberwave audio workflow nodes share a standard audio format for inter-node communication. This page explains the format, how nodes adapt incoming audio, and best practices for building audio pipelines.Documentation Index
Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt
Use this file to discover all available pages before exploring further.
Standard Audio Format
Every audio node in a workflow produces and consumes audio using:| Property | Value |
|---|---|
| Container | NumPy array (dtype=int16) |
| Encoding | PCM S16LE (signed 16-bit little-endian) |
| Sample Rate | 16,000 Hz |
| Channels | 1 (mono) |
| Key name | audio |
Why This Format
- Zero-copy between nodes: NumPy int16 arrays are passed by reference. A 10-second buffer (320,000 bytes) moves between nodes as a lightweight pointer — no data duplication.
- Universal compatibility: Every speech model (Whisper, Silero, OpenWakeWord) works natively at 16 kHz int16 mono.
- Minimal latency: No encoding/decoding step between nodes.
Automatic Format Adaptation
Every audio node includes an ingress adapter that automatically handles incoming audio regardless of format:| Incoming Format | What Happens | Performance |
|---|---|---|
| int16 numpy, 16 kHz, mono | Zero-copy passthrough (same object) | Fastest |
| int16 numpy, other rate | Resampled to 16 kHz | Fast |
| int16 numpy, stereo | Downmixed to mono | Fast |
| float32 numpy | Scaled to int16 (× 32768) | Fast |
| WAV bytes (RIFF header) | Parsed, decoded, resampled if needed | Moderate |
| Raw PCM bytes (no header) | Interpreted as int16, resampled if needed | Fast |
Metadata Alongside Audio
Audio is never passed as a naked array. Every node outputs a dictionary with the audio and its metadata:The Audio Pipeline
A typical audio-to-text pipeline:What each node does:
-
Audio Track — Receives raw 20 ms chunks from the edge microphone via Zenoh. Accumulates them in a FIFO buffer and emits when the buffer is full. Normalizes to int16 @ 16 kHz mono. Buffer presets:
vad: 512 samples (32 ms) — for Audio Assistant / Silero VADwake-word: 1,280 samples (80 ms) — matches OpenWakeWord’s internal frame sizestt: 64,000 samples (4 s) — for direct STT inputcustom: user-defined duration in seconds
audiokey plus metadata (sample_rate_hz,channels,duration_s,sample_count). -
Audio Assistant — Runs Silero VAD on the continuous stream. Produces two signals on every chunk:
speech_probability(float 0.0-1.0): The raw VAD confidence — a continuous trigger/logic gateaudio(int16 numpy): Only emitted when a complete speech segment is detected (speech end)
-
Wake Word Engine — Listens for a configurable trigger phrase using OpenWakeWord. Users can select from pre-trained models (alexa, hey_mycroft, hey_jarvis, etc.) or provide a custom
.onnxmodel. After detection, streams fixed-size audio chunks to downstream nodes until silence is detected. Output chunk size is configurable:vadpreset: 512 samples (32 ms) — for feeding another VAD stagesttpreset: 64,000 samples (4 s) — optimized for Whisper/STT (default)custompreset: user-defined buffer size in seconds
audiokey. The final chunk before silence timeout is marked withis_final_chunk: trueso downstream STT can finalize transcription. -
Call Model (Whisper) — Receives int16 audio, passes it (with
sample_rate_hzandchannelsmetadata) to the model’spredict()function. Outputs the transcription as a string.
ML model environment
Call Model nodes receive both the audio array and its metadata. The model’s predict call always knows the format:Connecting Nodes
All audio nodes use the sameaudio key for input and output. In the workflow editor, you simply connect one node’s output to the next node’s input — no format configuration needed.
Compile-Time Validation
When you compile a workflow, the assembler validates that:- Audio-consuming nodes (Audio Assistant, Wake Word Engine) have an audio-producing upstream
- The chain is topologically valid
Supported Input Formats (per node)
Every audio node accepts these formats as input:| Format | Example Source |
|---|---|
| NumPy int16 | Audio Track output, other audio nodes |
| NumPy float32 | Custom code nodes, ML model outputs |
| WAV bytes | File uploads, cloud API responses |
| Raw PCM bytes | WebSocket streams, raw microphone drivers |
Error Handling
When a node receives audio it cannot process:- Logs an error with format details (detected type, rate, channels)
- Publishes an alert to the twin (
alert_type='audio_format_incompatible') visible in the Cyberwave UI - Skips the frame gracefully — the workflow continues with the next chunk
Performance Notes
- Zero-copy is the fast path: When the upstream node outputs int16 @ 16 kHz mono (the standard), downstream nodes receive the exact same memory — no allocation, no copy.
- Avoid unnecessary WAV wrapping: If your pipeline is fully edge-side, keep audio as int16 numpy throughout. Only convert to WAV when sending to external APIs.
- NumPy slicing is zero-copy too:
audio[start:end]creates a view, not a copy. The Wake Word Engine uses this for its lookback buffer. - Buffered streaming: The Wake Word Engine outputs fixed-size chunks (not one giant array). This keeps memory bounded and allows downstream STT to start processing immediately as chunks arrive.
Related Pages
Audio Track
Edge microphone trigger node
Audio Assistant
VAD-powered speech segmentation
Wake Word Engine
Voice-activated trigger gate