> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Technical Reference

> Audio Assistant I/O schema, buffering, and runtime architecture.

## Node inputs (configuration)

Stored in `node.parameters` unless wired from upstream:

| Field                  | VA | SSG | Description                                  |
| ---------------------- | -- | --- | -------------------------------------------- |
| `modality`             | ✓  | ✓   | `voice_assistant` or `sound_security_guard`  |
| `sub_modality`         | ✓  | ✓   | VAD profile or security scenario             |
| `vad_*`                | ✓  | —   | Silero thresholds (advanced)                 |
| `output_buffer_preset` | ✓  | ✓   | VA: output re-chunking; SSG: analysis window |
| `output_buffer_size_s` | ✓  | ✓   | Custom buffer/window seconds                 |
| `confidence_threshold` | —  | ✓   | SSG alert threshold (0–1)                    |
| `event_cooldown_s`     | —  | ✓   | SSG seconds between alerts                   |
| `custom_event_labels`  | —  | ✓   | SSG custom scenario label list               |

Runtime audio is resolved automatically from upstream `audio` via `_CwAudioIngress` (no manual `sample_rate_hz` input).

## VA outputs

| Output                | Type                        | When emitted            |
| --------------------- | --------------------------- | ----------------------- |
| `audio`               | AUDIO (int16 @ 16 kHz mono) | Speech segment complete |
| `speech_probability`  | NUMBER                      | Every chunk             |
| `is_speaking`         | BOOLEAN                     | Every chunk             |
| `start_timestamp_sec` | NUMBER                      | With `audio`            |
| `end_timestamp_sec`   | NUMBER                      | With `audio`            |
| `sample_rate_hz`      | NUMBER                      | Always `16000`          |
| `channels`            | NUMBER                      | Always `1`              |

## SSG outputs

| Output                | Type    | When emitted                      |
| --------------------- | ------- | --------------------------------- |
| `audio`               | AUDIO   | Alert fired (analysis window)     |
| `event_detected`      | BOOLEAN | Every chunk                       |
| `event_confidence`    | NUMBER  | Every chunk (best scenario match) |
| `event_label`         | STRING  | Best matching AudioSet label      |
| `event_type`          | STRING  | Active scenario key               |
| `active_scenario`     | STRING  | Same as `event_type`              |
| `start_timestamp_sec` | NUMBER  | With alert                        |
| `end_timestamp_sec`   | NUMBER  | With alert                        |
| `sample_rate_hz`      | NUMBER  | Always `16000`                    |
| `channels`            | NUMBER  | Always `1`                        |

## Audio ingress (shared)

All modalities use `_CwAudioIngress.adapt_safe()`:

* **Canonical format:** numpy `int16`, 16000 Hz, mono, key `audio`
* **Accepted inputs:** int16/float32 numpy, raw PCM bytes, WAV bytes
* **Passthrough:** int16 @ 16 kHz mono is zero-copy

## VA architecture

`_CwAudioStreamProcessor` + Silero `VADIterator`:

1. Incoming chunks reframed to **512 samples** (32 ms @ 16 kHz)
2. Ring buffer stores float32 audio; indices from VAD mark utterance bounds
3. On speech end: slice buffer → int16 segment → emit

## SSG architecture

`_CwAaAstGuard` + Hugging Face AST:

1. Accumulate int16 chunks until `window_samples` (from buffer preset, default 4 s)
2. Convert to float32, run `ASTFeatureExtractor` + `ASTForAudioClassification`
3. Multi-label sigmoid over 527 classes; filter by active scenario labels only
4. If `max(matching_probs) >= confidence_threshold` → emit alert + audio window
5. Slide buffer by `hop_samples` (50% overlap); enforce `event_cooldown_s`

### Model

* **ID:** `MIT/ast-finetuned-audioset-10-10-0.4593`
* **Training:** AudioSet (527 classes)
* **Inference:** `torch.sigmoid(logits)` per class

### Scenario label maps (built-in)

```text theme={null}
glass_break      → Glass, Shatter, Smash crash, Thump, Crash, Bang
constant_alarm   → Alarm, Car alarm, Smoke detector, Siren, …
screaming        → Screaming, Yell, Shout, Children shouting
custom           → user-provided substrings
```

Matching is case-insensitive substring: `target in class_label.lower()`.

## Buffer presets (shared field)

| Preset           | Seconds                | Samples @ 16 kHz | Typical modality                                |
| ---------------- | ---------------------- | ---------------- | ----------------------------------------------- |
| `none`           | —                      | —                | VA: full utterance                              |
| `wake-word`      | 0.08                   | 1280             | VA downstream / not recommended for SSG         |
| `speech-to-text` | 4.0                    | 64000            | VA STT chunks / **SSG default analysis window** |
| `custom`         | `output_buffer_size_s` | —                | User-defined                                    |

SSG enforces a minimum analysis window of **1.0 s**.

## Alert attribution

On SSG alert, `_source_summary` includes `kind: audio_assistant`, `modality: sound_security_guard`, `sub_modality`, `event_label`, and `event_confidence` for Send Alert `source_chain` metadata.

## Edge cases

| Case                           | Behavior                                      |
| ------------------------------ | --------------------------------------------- |
| Empty audio chunk              | Skipped                                       |
| SSG during cooldown            | Analysis skipped until cooldown expires       |
| Custom scenario with no labels | No matches; never alerts                      |
| Upstream non-16 kHz            | Resampled by audio ingress                    |
| VA oversized chunks            | Reframed to 512 samples with one-time warning |