Skip to main content
All Cyberwave audio workflow nodes share a standard audio format for inter-node communication. This page explains the format, how nodes adapt incoming audio, and best practices for building audio pipelines.

Standard Audio Format

Every audio node in a workflow produces and consumes audio using:
PropertyValue
ContainerNumPy array (dtype=int16)
EncodingPCM S16LE (signed 16-bit little-endian)
Sample Rate16,000 Hz
Channels1 (mono)
Key nameaudio
This is raw 16-bit PCM — the same encoding used by telephony systems, Whisper, Silero VAD, and OpenWakeWord. No compression, no container overhead.

Why This Format

  • Zero-copy between nodes: NumPy int16 arrays are passed by reference. A 10-second buffer (320,000 bytes) moves between nodes as a lightweight pointer — no data duplication.
  • Universal compatibility: Every speech model (Whisper, Silero, OpenWakeWord) works natively at 16 kHz int16 mono.
  • Minimal latency: No encoding/decoding step between nodes.

Automatic Format Adaptation

Every audio node includes an ingress adapter that automatically handles incoming audio regardless of format:
Incoming FormatWhat HappensPerformance
int16 numpy, 16 kHz, monoZero-copy passthrough (same object)Fastest
int16 numpy, other rateResampled to 16 kHzFast
int16 numpy, stereoDownmixed to monoFast
float32 numpyScaled to int16 (× 32768)Fast
WAV bytes (RIFF header)Parsed, decoded, resampled if neededModerate
Raw PCM bytes (no header)Interpreted as int16, resampled if neededFast
If the format cannot be adapted (e.g. compressed MP3 bytes without a decoder), the node publishes a human-readable alert explaining what was received and what was expected.

Metadata Alongside Audio

Audio is never passed as a naked array. Every node outputs a dictionary with the audio and its metadata:
{
    "audio": numpy_int16_array,      # the PCM data (zero-copy reference)
    "sample_rate_hz": 16000,         # always 16 kHz after normalization
    "channels": 1,                   # always mono after downmix
    "speech_probability": 0.87,      # VAD confidence (Audio Assistant only)
    "is_speaking": True,             # speech state flag
    "start_timestamp_sec": 1.2,      # segment boundary (when available)
    "end_timestamp_sec": 3.8,        # segment boundary (when available)
}
This means downstream nodes always know the sample rate, channel count, and timing of what they receive — even in buffer/chunk streaming scenarios.

The Audio Pipeline

A typical audio-to-text pipeline:
Microphone → Audio Track → Audio Assistant (VAD) → Wake Word Engine → Call Model (Whisper) → text

What each node does:

  1. Audio Track — Receives raw 20 ms chunks from the edge microphone via Zenoh. Accumulates them in a FIFO buffer and emits when the buffer is full. Normalizes to int16 @ 16 kHz mono. Buffer presets:
    • voice-assistant: 512 samples (32 ms) — for Audio Assistant / Silero VAD
    • wake-word: 1,280 samples (80 ms) — matches OpenWakeWord’s internal frame size
    • speech-to-text: 64,000 samples (4 s) — for direct STT input
    • custom: user-defined duration in seconds
    Outputs the buffered audio via the audio key plus metadata (sample_rate_hz, channels, duration_s, sample_count).
  2. Audio Assistant — Runs Silero VAD on the continuous stream. Produces two signals on every chunk:
    • speech_probability (float 0.0-1.0): The raw VAD confidence — a continuous trigger/logic gate
    • audio (int16 numpy): Only emitted when a complete speech segment is detected (speech end)
    This dual-output design means downstream nodes can react to the VAD trigger immediately, without waiting for the full segment extraction. Optionally, the Output Buffer Preset re-chunks the extracted speech into fixed-size frames for downstream consumption:
    • wake-word: 1,280 samples (80 ms) — for feeding a Wake Word Engine
    • speech-to-text: 64,000 samples (4 s) — for Whisper/STT models
    • custom: user-defined buffer size in seconds
    • none (default): pass the full speech segment as-is
  3. Wake Word Engine — Listens for a configurable trigger phrase using OpenWakeWord. Users select one or more pre-trained models (alexa, hey_mycroft, hey_jarvis, hey_rhasspy, weather, timer) from a searchable multi-select — the engine activates on any of them when that model’s score ≥ a single Detection Threshold (default 0.5 for all models). Custom .onnx model support is coming soon. Pre-trained models are downloaded automatically on the edge device via openwakeword.utils.download_models(). Accepts PCM arrays, raw bytes, WAV bytes, or WAV file paths as input. Internally buffers input into exact 80 ms (1280-sample) frames for optimal openWakeWord prediction accuracy. After detection, streams fixed-size audio chunks to downstream nodes until silence is detected. Output chunk size is configurable:
    • voice-assistant preset: 512 samples (32 ms) — for feeding another VAD stage
    • wake-word preset: 1,280 samples (80 ms) — openWakeWord native frame
    • speech-to-text preset: 64,000 samples (4 s) — optimized for Whisper/STT
    • Default output preset is wake-word (80 ms); use speech-to-text when Whisper follows
    • custom preset: user-defined buffer size in seconds
    One command-only audio buffer is emitted per wake session (wake phrase trimmed via Wake Word Trim). Wire Call Model STT to the audio output. Input validation: The Wake Word Engine validates that its audio input meets openWakeWord requirements (16 kHz, int16/float32, mono, raw PCM). If the upstream audio cannot be adapted to meet these specs, the node raises an error with a detailed explanation.
  4. Call Model (Whisper) — Receives int16 audio, passes it (with sample_rate_hz and channels metadata) to the model’s predict() function. Outputs the transcription as a string.

Voice → robot command (optional tail)

After STT, add controller-aware matching and MQTT dispatch:
call_model.result
  → fuzzy_matcher.query           (Uncertain String)
twin.control_actuations (or control_labels)
  → fuzzy_matcher.candidates      (Source of Truth: string or array)
fuzzy_matcher.match_string → virtual_controller.command
NodeRole
TwinSame twin as Virtual Controller; exposes valid commands from assigned policy
Fuzzy MatcherMaps the uncertain string to the best source-of-truth entry (match, match_string, score)
Virtual ControllerResolves label/actuation and publishes cyberwave/twin/{uuid}/command
Branch on fuzzy_matcher.match with Conditional before Virtual Controller so unrelated speech does not dispatch commands. See Fuzzy Matcher and Virtual Controller.

ML model environment

Call Model nodes receive both the audio array and its metadata. The model’s predict call always knows the format:
model.predict(audio_int16, sample_rate_hz=16000, channels=1, twin_uuid="...")
For cloud APIs that require WAV, the ingress adapter automatically converts int16 to WAV bytes before upload.

Connecting Nodes

All audio nodes use the same audio key for input and output. In the workflow editor, you simply connect one node’s output to the next node’s input — no format configuration needed.
Node A [audio] ──→ [audio] Node B [audio] ──→ [audio] Node C
The nodes handle format adaptation internally. You never need to insert format conversion nodes or configure sample rates manually.

Compile-Time Validation

When you compile a workflow for edge (run_on_edge: true), the Django compile server imports the libraries each node needs and fails with an actionable message if a package is missing (same pattern as Wake Word Engine).
Node / modelCompile-server packages
Wake Word Engineopenwakeword, onnxruntime
Fuzzy Matcherrapidfuzz
Audio Assistantsilero-vad (+ torch transitive)
Call Model — whisper.cpppywhispercpp
Call Model — faster-whisperfaster-whisper
Twin / Virtual Controllerhttpx, paho-mqtt (base deps)
Audio Trackno ML import at compile
Packages live in cyberwave-backend/requirements/base.txt. Rebuild the Django image after changes. Full tables, SDK extras, and slim worker builds: Edge workflow dependencies. The assembler also validates that audio-consuming nodes have an audio-producing upstream and that the chain is topologically valid.

Supported Input Formats (per node)

Every audio node accepts these formats as input:
FormatExample Source
NumPy int16Audio Track output, other audio nodes
NumPy float32Custom code nodes, ML model outputs
WAV bytesFile uploads, cloud API responses
Raw PCM bytesWebSocket streams, raw microphone drivers

Error Handling

When a node receives audio it cannot process:
  1. Logs an error with format details (detected type, rate, channels)
  2. Publishes an alert to the twin (alert_type='audio_format_incompatible') visible in the Cyberwave UI
  3. Skips the frame gracefully — the workflow continues with the next chunk
No crashes. No silent failures.

Performance Notes

  • Zero-copy is the fast path: When the upstream node outputs int16 @ 16 kHz mono (the standard), downstream nodes receive the exact same memory — no allocation, no copy.
  • Avoid unnecessary WAV wrapping: If your pipeline is fully edge-side, keep audio as int16 numpy throughout. Only convert to WAV when sending to external APIs.
  • NumPy slicing is zero-copy too: audio[start:end] creates a view, not a copy. The Wake Word Engine uses this for its lookback buffer.
  • Buffered streaming: The Wake Word Engine outputs fixed-size chunks (not one giant array). This keeps memory bounded and allows downstream STT to start processing immediately as chunks arrive.

Edge dependencies

Install SDK extras on the edge host (or cyberwave-edge-core[...] passthroughs). edge-sync returns model_requirements from node emitters and Call Model catalog metadata.
Node / modeledge-sync extraInstall
Audio Trackcyberwave[zenoh] on worker; cyberwave[microphone] on capture host
Audio Assistantml-audiocyberwave[ml-audio]
Wake Word Engineml-wakewordcyberwave[ml-wakeword]
Fuzzy Matcherfuzzy-matchcyberwave[fuzzy-match]
Call Model STTml-stt or ml-stt-fasterPer catalog model
Twin / Virtual ControllerBase cyberwave (API + MQTT)
On Raspberry Pi, install only the extras your workflow uses — not ml-all. See Edge workflow dependencies.

Edge dependencies

Compile server, SDK extras, STT catalog

Call Model STT

Whisper.cpp and Faster Whisper on edge

Audio Track

Edge microphone trigger node

Audio Assistant

VAD-powered speech segmentation

Wake Word Engine

Voice-activated trigger gate

Twin

Read twin + controller policy for command candidates

Fuzzy Matcher

Map noisy STT text to command labels

Virtual Controller

Dispatch matched commands to the twin over MQTT