Node inputs (configuration)
Stored innode.parameters unless wired from upstream:
| Field | VA | SSG | Description |
|---|---|---|---|
modality | ✓ | ✓ | voice_assistant or sound_security_guard |
sub_modality | ✓ | ✓ | VAD profile or security scenario |
vad_* | ✓ | — | Silero thresholds (advanced) |
output_buffer_preset | ✓ | ✓ | VA: output re-chunking; SSG: analysis window |
output_buffer_size_s | ✓ | ✓ | Custom buffer/window seconds |
confidence_threshold | — | ✓ | SSG alert threshold (0–1) |
event_cooldown_s | — | ✓ | SSG seconds between alerts |
custom_event_labels | — | ✓ | SSG custom scenario label list |
audio via _CwAudioIngress (no manual sample_rate_hz input).
VA outputs
| Output | Type | When emitted |
|---|---|---|
audio | AUDIO (int16 @ 16 kHz mono) | Speech segment complete |
speech_probability | NUMBER | Every chunk |
is_speaking | BOOLEAN | Every chunk |
start_timestamp_sec | NUMBER | With audio |
end_timestamp_sec | NUMBER | With audio |
sample_rate_hz | NUMBER | Always 16000 |
channels | NUMBER | Always 1 |
SSG outputs
| Output | Type | When emitted |
|---|---|---|
audio | AUDIO | Alert fired (analysis window) |
event_detected | BOOLEAN | Every chunk |
event_confidence | NUMBER | Every chunk (best scenario match) |
event_label | STRING | Best matching AudioSet label |
event_type | STRING | Active scenario key |
active_scenario | STRING | Same as event_type |
start_timestamp_sec | NUMBER | With alert |
end_timestamp_sec | NUMBER | With alert |
sample_rate_hz | NUMBER | Always 16000 |
channels | NUMBER | Always 1 |
Audio ingress (shared)
All modalities use_CwAudioIngress.adapt_safe():
- Canonical format: numpy
int16, 16000 Hz, mono, keyaudio - Accepted inputs: int16/float32 numpy, raw PCM bytes, WAV bytes
- Passthrough: int16 @ 16 kHz mono is zero-copy
VA architecture
_CwAudioStreamProcessor + Silero VADIterator:
- Incoming chunks reframed to 512 samples (32 ms @ 16 kHz)
- Ring buffer stores float32 audio; indices from VAD mark utterance bounds
- On speech end: slice buffer → int16 segment → emit
SSG architecture
_CwAaAstGuard + Hugging Face AST:
- Accumulate int16 chunks until
window_samples(from buffer preset, default 4 s) - Convert to float32, run
ASTFeatureExtractor+ASTForAudioClassification - Multi-label sigmoid over 527 classes; filter by active scenario labels only
- If
max(matching_probs) >= confidence_threshold→ emit alert + audio window - Slide buffer by
hop_samples(50% overlap); enforceevent_cooldown_s
Model
- ID:
MIT/ast-finetuned-audioset-10-10-0.4593 - Training: AudioSet (527 classes)
- Inference:
torch.sigmoid(logits)per class
Scenario label maps (built-in)
target in class_label.lower().
Buffer presets (shared field)
| Preset | Seconds | Samples @ 16 kHz | Typical modality |
|---|---|---|---|
none | — | — | VA: full utterance |
wake-word | 0.08 | 1280 | VA downstream / not recommended for SSG |
speech-to-text | 4.0 | 64000 | VA STT chunks / SSG default analysis window |
custom | output_buffer_size_s | — | User-defined |
Alert attribution
On SSG alert,_source_summary includes kind: audio_assistant, modality: sound_security_guard, sub_modality, event_label, and event_confidence for Send Alert source_chain metadata.
Edge cases
| Case | Behavior |
|---|---|
| Empty audio chunk | Skipped |
| SSG during cooldown | Analysis skipped until cooldown expires |
| Custom scenario with no labels | No matches; never alerts |
| Upstream non-16 kHz | Resampled by audio ingress |
| VA oversized chunks | Reframed to 512 samples with one-time warning |