Standard Audio Format
Every audio node in a workflow produces and consumes audio using:| Property | Value |
|---|---|
| Container | NumPy array (dtype=int16) |
| Encoding | PCM S16LE (signed 16-bit little-endian) |
| Sample Rate | 16,000 Hz |
| Channels | 1 (mono) |
| Key name | audio |
Why This Format
- Zero-copy between nodes: NumPy int16 arrays are passed by reference. A 10-second buffer (320,000 bytes) moves between nodes as a lightweight pointer — no data duplication.
- Universal compatibility: Every speech model (Whisper, Silero, OpenWakeWord) works natively at 16 kHz int16 mono.
- Minimal latency: No encoding/decoding step between nodes.
Automatic Format Adaptation
Every audio node includes an ingress adapter that automatically handles incoming audio regardless of format:| Incoming Format | What Happens | Performance |
|---|---|---|
| int16 numpy, 16 kHz, mono | Zero-copy passthrough (same object) | Fastest |
| int16 numpy, other rate | Resampled to 16 kHz | Fast |
| int16 numpy, stereo | Downmixed to mono | Fast |
| float32 numpy | Scaled to int16 (× 32768) | Fast |
| WAV bytes (RIFF header) | Parsed, decoded, resampled if needed | Moderate |
| Raw PCM bytes (no header) | Interpreted as int16, resampled if needed | Fast |
Metadata Alongside Audio
Audio is never passed as a naked array. Every node outputs a dictionary with the audio and its metadata:The Audio Pipeline
A typical audio-to-text pipeline:What each node does:
-
Audio Track — Receives raw 20 ms chunks from the edge microphone via Zenoh. Accumulates them in a FIFO buffer and emits when the buffer is full. Normalizes to int16 @ 16 kHz mono. Buffer presets:
voice-assistant: 512 samples (32 ms) — for Audio Assistant / Silero VADwake-word: 1,280 samples (80 ms) — matches OpenWakeWord’s internal frame sizespeech-to-text: 64,000 samples (4 s) — for direct STT inputcustom: user-defined duration in seconds
audiokey plus metadata (sample_rate_hz,channels,duration_s,sample_count). -
Audio Assistant — Runs Silero VAD on the continuous stream. Produces two signals on every chunk:
speech_probability(float 0.0-1.0): The raw VAD confidence — a continuous trigger/logic gateaudio(int16 numpy): Only emitted when a complete speech segment is detected (speech end)
wake-word: 1,280 samples (80 ms) — for feeding a Wake Word Enginespeech-to-text: 64,000 samples (4 s) — for Whisper/STT modelscustom: user-defined buffer size in secondsnone(default): pass the full speech segment as-is
-
Wake Word Engine — Listens for a configurable trigger phrase using OpenWakeWord. Users select one or more pre-trained models (
alexa,hey_mycroft,hey_jarvis,hey_rhasspy,weather,timer) from a searchable multi-select — the engine activates on any of them when that model’s score ≥ a single Detection Threshold (default 0.5 for all models). Custom.onnxmodel support is coming soon. Pre-trained models are downloaded automatically on the edge device viaopenwakeword.utils.download_models(). Accepts PCM arrays, raw bytes, WAV bytes, or WAV file paths as input. Internally buffers input into exact 80 ms (1280-sample) frames for optimal openWakeWord prediction accuracy. After detection, streams fixed-size audio chunks to downstream nodes until silence is detected. Output chunk size is configurable:voice-assistantpreset: 512 samples (32 ms) — for feeding another VAD stagewake-wordpreset: 1,280 samples (80 ms) — openWakeWord native framespeech-to-textpreset: 64,000 samples (4 s) — optimized for Whisper/STT- Default output preset is
wake-word(80 ms); usespeech-to-textwhen Whisper follows custompreset: user-defined buffer size in seconds
audiobuffer is emitted per wake session (wake phrase trimmed via Wake Word Trim). Wire Call Model STT to theaudiooutput. Input validation: The Wake Word Engine validates that its audio input meets openWakeWord requirements (16 kHz, int16/float32, mono, raw PCM). If the upstream audio cannot be adapted to meet these specs, the node raises an error with a detailed explanation. -
Call Model (Whisper) — Receives int16 audio, passes it (with
sample_rate_hzandchannelsmetadata) to the model’spredict()function. Outputs the transcription as a string.
Voice → robot command (optional tail)
After STT, add controller-aware matching and MQTT dispatch:| Node | Role |
|---|---|
| Twin | Same twin as Virtual Controller; exposes valid commands from assigned policy |
| Fuzzy Matcher | Maps the uncertain string to the best source-of-truth entry (match, match_string, score) |
| Virtual Controller | Resolves label/actuation and publishes cyberwave/twin/{uuid}/command |
fuzzy_matcher.match with Conditional before Virtual Controller so unrelated speech does not dispatch commands. See Fuzzy Matcher and Virtual Controller.
ML model environment
Call Model nodes receive both the audio array and its metadata. The model’s predict call always knows the format:Connecting Nodes
All audio nodes use the sameaudio key for input and output. In the workflow editor, you simply connect one node’s output to the next node’s input — no format configuration needed.
Compile-Time Validation
When you compile a workflow for edge (run_on_edge: true), the Django compile server imports the libraries each node needs and fails with an actionable message if a package is missing (same pattern as Wake Word Engine).
| Node / model | Compile-server packages |
|---|---|
| Wake Word Engine | openwakeword, onnxruntime |
| Fuzzy Matcher | rapidfuzz |
| Audio Assistant | silero-vad (+ torch transitive) |
| Call Model — whisper.cpp | pywhispercpp |
| Call Model — faster-whisper | faster-whisper |
| Twin / Virtual Controller | httpx, paho-mqtt (base deps) |
| Audio Track | no ML import at compile |
cyberwave-backend/requirements/base.txt. Rebuild the Django image after changes. Full tables, SDK extras, and slim worker builds: Edge workflow dependencies.
The assembler also validates that audio-consuming nodes have an audio-producing upstream and that the chain is topologically valid.
Supported Input Formats (per node)
Every audio node accepts these formats as input:| Format | Example Source |
|---|---|
| NumPy int16 | Audio Track output, other audio nodes |
| NumPy float32 | Custom code nodes, ML model outputs |
| WAV bytes | File uploads, cloud API responses |
| Raw PCM bytes | WebSocket streams, raw microphone drivers |
Error Handling
When a node receives audio it cannot process:- Logs an error with format details (detected type, rate, channels)
- Publishes an alert to the twin (
alert_type='audio_format_incompatible') visible in the Cyberwave UI - Skips the frame gracefully — the workflow continues with the next chunk
Performance Notes
- Zero-copy is the fast path: When the upstream node outputs int16 @ 16 kHz mono (the standard), downstream nodes receive the exact same memory — no allocation, no copy.
- Avoid unnecessary WAV wrapping: If your pipeline is fully edge-side, keep audio as int16 numpy throughout. Only convert to WAV when sending to external APIs.
- NumPy slicing is zero-copy too:
audio[start:end]creates a view, not a copy. The Wake Word Engine uses this for its lookback buffer. - Buffered streaming: The Wake Word Engine outputs fixed-size chunks (not one giant array). This keeps memory bounded and allows downstream STT to start processing immediately as chunks arrive.
Edge dependencies
Install SDK extras on the edge host (orcyberwave-edge-core[...] passthroughs). edge-sync returns model_requirements from node emitters and Call Model catalog metadata.
| Node / model | edge-sync extra | Install |
|---|---|---|
| Audio Track | — | cyberwave[zenoh] on worker; cyberwave[microphone] on capture host |
| Audio Assistant | ml-audio | cyberwave[ml-audio] |
| Wake Word Engine | ml-wakeword | cyberwave[ml-wakeword] |
| Fuzzy Matcher | fuzzy-match | cyberwave[fuzzy-match] |
| Call Model STT | ml-stt or ml-stt-faster | Per catalog model |
| Twin / Virtual Controller | — | Base cyberwave (API + MQTT) |
ml-all. See Edge workflow dependencies.
Related Pages
Edge dependencies
Compile server, SDK extras, STT catalog
Call Model STT
Whisper.cpp and Faster Whisper on edge
Audio Track
Edge microphone trigger node
Audio Assistant
VAD-powered speech segmentation
Wake Word Engine
Voice-activated trigger gate
Twin
Read twin + controller policy for command candidates
Fuzzy Matcher
Map noisy STT text to command labels
Virtual Controller
Dispatch matched commands to the twin over MQTT