> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio in Workflows

> How audio transmission, encoding, and format adaptation works across workflow nodes

All Cyberwave audio workflow nodes share a **standard audio format** for inter-node communication. This page explains the format, how nodes adapt incoming audio, and best practices for building audio pipelines.

## Standard Audio Format

Every audio node in a workflow produces and consumes audio using:

| Property        | Value                                   |
| --------------- | --------------------------------------- |
| **Container**   | NumPy array (`dtype=int16`)             |
| **Encoding**    | PCM S16LE (signed 16-bit little-endian) |
| **Sample Rate** | 16,000 Hz                               |
| **Channels**    | 1 (mono)                                |
| **Key name**    | `audio`                                 |

This is raw 16-bit PCM — the same encoding used by telephony systems, Whisper, Silero VAD, and OpenWakeWord. No compression, no container overhead.

## Why This Format

* **Zero-copy between nodes**: NumPy int16 arrays are passed by reference. A 10-second buffer (320,000 bytes) moves between nodes as a lightweight pointer — no data duplication.
* **Universal compatibility**: Every speech model (Whisper, Silero, OpenWakeWord) works natively at 16 kHz int16 mono.
* **Minimal latency**: No encoding/decoding step between nodes.

## Automatic Format Adaptation

Every audio node includes an **ingress adapter** that automatically handles incoming audio regardless of format:

| Incoming Format           | What Happens                              | Performance |
| ------------------------- | ----------------------------------------- | ----------- |
| int16 numpy, 16 kHz, mono | **Zero-copy passthrough** (same object)   | Fastest     |
| int16 numpy, other rate   | Resampled to 16 kHz                       | Fast        |
| int16 numpy, stereo       | Downmixed to mono                         | Fast        |
| float32 numpy             | Scaled to int16 (`× 32768`)               | Fast        |
| WAV bytes (RIFF header)   | Parsed, decoded, resampled if needed      | Moderate    |
| Raw PCM bytes (no header) | Interpreted as int16, resampled if needed | Fast        |

If the format cannot be adapted (e.g. compressed MP3 bytes without a decoder), the node publishes a **human-readable alert** explaining what was received and what was expected.

## Metadata Alongside Audio

Audio is never passed as a naked array. Every node outputs a **dictionary** with the audio and its metadata:

```python theme={null}
{
    "audio": numpy_int16_array,      # the PCM data (zero-copy reference)
    "sample_rate_hz": 16000,         # always 16 kHz after normalization
    "channels": 1,                   # always mono after downmix
    "speech_probability": 0.87,      # VAD confidence (Audio Assistant only)
    "is_speaking": True,             # speech state flag
    "start_timestamp_sec": 1.2,      # segment boundary (when available)
    "end_timestamp_sec": 3.8,        # segment boundary (when available)
}
```

This means downstream nodes always know the sample rate, channel count, and timing of what they receive — even in buffer/chunk streaming scenarios.

## The Audio Pipeline

A typical audio-to-text pipeline:

```
Microphone → Audio Track → Audio Assistant (VAD) → Wake Word Engine → Call Model (Whisper) → text
```

### What each node does:

1. **Audio Track** — Receives raw 20 ms chunks from the edge microphone via Zenoh. Accumulates them in a FIFO buffer and emits when the buffer is full. Normalizes to int16 @ 16 kHz mono. Buffer presets:

   * `voice-assistant`: 512 samples (32 ms) — for Audio Assistant / Silero VAD
   * `wake-word`: 1,280 samples (80 ms) — matches OpenWakeWord's internal frame size
   * `speech-to-text`: 64,000 samples (4 s) — for direct STT input
   * `custom`: user-defined duration in seconds

   Outputs the buffered audio via the `audio` key plus metadata (`sample_rate_hz`, `channels`, `duration_s`, `sample_count`).

2. **Audio Assistant** — Runs Silero VAD on the continuous stream. Produces **two signals on every chunk**:

   * `speech_probability` (float 0.0-1.0): The raw VAD confidence — a continuous trigger/logic gate
   * `audio` (int16 numpy): Only emitted when a complete speech segment is detected (speech end)

   This dual-output design means downstream nodes can react to the VAD trigger immediately, without waiting for the full segment extraction.

   Optionally, the **Output Buffer Preset** re-chunks the extracted speech into fixed-size frames for downstream consumption:

   * `wake-word`: 1,280 samples (80 ms) — for feeding a Wake Word Engine
   * `speech-to-text`: 64,000 samples (4 s) — for Whisper/STT models
   * `custom`: user-defined buffer size in seconds
   * `none` (default): pass the full speech segment as-is

3. **Wake Word Engine** — Listens for a configurable trigger phrase using [OpenWakeWord](https://github.com/dscripka/openWakeWord). Users select one or more pre-trained models (`alexa`, `hey_mycroft`, `hey_jarvis`, `hey_rhasspy`, `weather`, `timer`) from a searchable multi-select — the engine activates on any of them when that model’s score ≥ a single **Detection Threshold** (default **0.5** for all models). Custom `.onnx` model support is coming soon. Pre-trained models are downloaded automatically on the edge device via `openwakeword.utils.download_models()`. Accepts PCM arrays, raw bytes, WAV bytes, or WAV file paths as input. Internally buffers input into exact 80 ms (1280-sample) frames for optimal openWakeWord prediction accuracy. After detection, streams **fixed-size audio chunks** to downstream nodes until silence is detected. Output chunk size is configurable:

   * `voice-assistant` preset: 512 samples (32 ms) — for feeding another VAD stage
   * `wake-word` preset: 1,280 samples (80 ms) — openWakeWord native frame
   * `speech-to-text` preset: 64,000 samples (4 s) — optimized for Whisper/STT
   * Default output preset is `wake-word` (80 ms); use `speech-to-text` when Whisper follows
   * `custom` preset: user-defined buffer size in seconds

   One command-only `audio` buffer is emitted per wake session (wake phrase trimmed via **Wake Word Trim**). Wire Call Model STT to the `audio` output.

   **Input validation**: The Wake Word Engine validates that its audio input meets openWakeWord requirements (16 kHz, int16/float32, mono, raw PCM). If the upstream audio cannot be adapted to meet these specs, the node raises an error with a detailed explanation.

4. **Call Model (Whisper)** — Receives int16 audio, passes it (with `sample_rate_hz` and `channels` metadata) to the model's `predict()` function. Outputs the transcription as a string.

### Voice → robot command (optional tail)

After STT, add controller-aware matching and MQTT dispatch:

```
call_model.result
  → fuzzy_matcher.query           (Uncertain String)
twin.control_actuations (or control_labels)
  → fuzzy_matcher.candidates      (Source of Truth: string or array)
fuzzy_matcher.match_string → virtual_controller.command
```

| Node                   | Role                                                                                           |
| ---------------------- | ---------------------------------------------------------------------------------------------- |
| **Twin**               | Same twin as Virtual Controller; exposes valid commands from assigned policy                   |
| **Fuzzy Matcher**      | Maps the uncertain string to the best source-of-truth entry (`match`, `match_string`, `score`) |
| **Virtual Controller** | Resolves label/actuation and publishes `cyberwave/twin/{uuid}/command`                         |

Branch on `fuzzy_matcher.match` with **Conditional** before Virtual Controller so unrelated speech does not dispatch commands. See [Fuzzy Matcher](/feature-reference/workflows/fuzzy-matcher) and [Virtual Controller](/feature-reference/workflows/virtual-controller).

### ML model environment

Call Model nodes receive both the audio array and its metadata. The model's predict call always knows the format:

```python theme={null}
model.predict(audio_int16, sample_rate_hz=16000, channels=1, twin_uuid="...")
```

For cloud APIs that require WAV, the ingress adapter automatically converts int16 to WAV bytes before upload.

## Connecting Nodes

All audio nodes use the **same `audio` key** for input and output. In the workflow editor, you simply connect one node's output to the next node's input — no format configuration needed.

```
Node A [audio] ──→ [audio] Node B [audio] ──→ [audio] Node C
```

The nodes handle format adaptation internally. You never need to insert format conversion nodes or configure sample rates manually.

## Compile-Time Validation

When you compile a workflow for edge (`run_on_edge: true`), the Django **compile server** imports the libraries each node needs and fails with an actionable message if a package is missing (same pattern as Wake Word Engine).

| Node / model                | Compile-server packages           |
| --------------------------- | --------------------------------- |
| Wake Word Engine            | `openwakeword`, `onnxruntime`     |
| Fuzzy Matcher               | `rapidfuzz`                       |
| Audio Assistant             | `silero-vad` (+ torch transitive) |
| Call Model — whisper.cpp    | `pywhispercpp`                    |
| Call Model — faster-whisper | `faster-whisper`                  |
| Twin / Virtual Controller   | `httpx`, `paho-mqtt` (base deps)  |
| Audio Track                 | no ML import at compile           |

Packages live in `cyberwave-backend/requirements/base.txt`. Rebuild the Django image after changes. Full tables, SDK extras, and slim worker builds: **[Edge workflow dependencies](/feature-reference/workflows/edge-dependencies)**.

The assembler also validates that audio-consuming nodes have an audio-producing upstream and that the chain is topologically valid.

## Supported Input Formats (per node)

Every audio node accepts these formats as input:

| Format        | Example Source                            |
| ------------- | ----------------------------------------- |
| NumPy int16   | Audio Track output, other audio nodes     |
| NumPy float32 | Custom code nodes, ML model outputs       |
| WAV bytes     | File uploads, cloud API responses         |
| Raw PCM bytes | WebSocket streams, raw microphone drivers |

## Error Handling

When a node receives audio it cannot process:

1. **Logs an error** with format details (detected type, rate, channels)
2. **Publishes an alert** to the twin (`alert_type='audio_format_incompatible'`) visible in the Cyberwave UI
3. **Skips the frame** gracefully — the workflow continues with the next chunk

No crashes. No silent failures.

## Performance Notes

* **Zero-copy is the fast path**: When the upstream node outputs int16 @ 16 kHz mono (the standard), downstream nodes receive the exact same memory — no allocation, no copy.
* **Avoid unnecessary WAV wrapping**: If your pipeline is fully edge-side, keep audio as int16 numpy throughout. Only convert to WAV when sending to external APIs.
* **NumPy slicing is zero-copy too**: `audio[start:end]` creates a view, not a copy. The Wake Word Engine uses this for its lookback buffer.
* **Buffered streaming**: The Wake Word Engine outputs fixed-size chunks (not one giant array). This keeps memory bounded and allows downstream STT to start processing immediately as chunks arrive.

## Edge dependencies

Install SDK extras on the edge host (or `cyberwave-edge-core[...]` passthroughs). `edge-sync` returns `model_requirements` from node emitters and Call Model catalog metadata.

| Node / model              | `edge-sync` extra           | Install                                                               |
| ------------------------- | --------------------------- | --------------------------------------------------------------------- |
| Audio Track               | —                           | `cyberwave[zenoh]` on worker; `cyberwave[microphone]` on capture host |
| Audio Assistant           | `ml-audio`                  | `cyberwave[ml-audio]`                                                 |
| Wake Word Engine          | `ml-wakeword`               | `cyberwave[ml-wakeword]`                                              |
| Fuzzy Matcher             | `fuzzy-match`               | `cyberwave[fuzzy-match]`                                              |
| Call Model STT            | `ml-stt` or `ml-stt-faster` | Per [catalog model](/feature-reference/workflows/call-model-stt-edge) |
| Twin / Virtual Controller | —                           | Base `cyberwave` (API + MQTT)                                         |

On Raspberry Pi, install **only** the extras your workflow uses — not `ml-all`. See **[Edge workflow dependencies](/feature-reference/workflows/edge-dependencies)**.

## Related Pages

<CardGroup cols={3}>
  <Card title="Edge dependencies" icon="puzzle-piece" href="/feature-reference/workflows/edge-dependencies">
    Compile server, SDK extras, STT catalog
  </Card>

  <Card title="Call Model STT" icon="microphone-lines" href="/feature-reference/workflows/call-model-stt-edge">
    Whisper.cpp and Faster Whisper on edge
  </Card>

  <Card title="Audio Track" icon="microphone" href="/feature-reference/workflows/audio-track/overview">
    Edge microphone trigger node
  </Card>

  <Card title="Audio Assistant" icon="volume-high" href="/feature-reference/workflows/audio-assistant/overview">
    VAD-powered speech segmentation
  </Card>

  <Card title="Wake Word Engine" icon="microphone" href="/feature-reference/workflows/wake-word-engine/overview">
    Voice-activated trigger gate
  </Card>

  <Card title="Twin" icon="robot" href="/feature-reference/workflows/twin">
    Read twin + controller policy for command candidates
  </Card>

  <Card title="Fuzzy Matcher" icon="bullseye" href="/feature-reference/workflows/fuzzy-matcher">
    Map noisy STT text to command labels
  </Card>

  <Card title="Virtual Controller" icon="gamepad" href="/feature-reference/workflows/virtual-controller">
    Dispatch matched commands to the twin over MQTT
  </Card>
</CardGroup>
