Skip to main content

Node inputs (configuration)

Stored in node.parameters unless wired from upstream:
FieldVASSGDescription
modalityvoice_assistant or sound_security_guard
sub_modalityVAD profile or security scenario
vad_*Silero thresholds (advanced)
output_buffer_presetVA: output re-chunking; SSG: analysis window
output_buffer_size_sCustom buffer/window seconds
confidence_thresholdSSG alert threshold (0–1)
event_cooldown_sSSG seconds between alerts
custom_event_labelsSSG custom scenario label list
Runtime audio is resolved automatically from upstream audio via _CwAudioIngress (no manual sample_rate_hz input).

VA outputs

OutputTypeWhen emitted
audioAUDIO (int16 @ 16 kHz mono)Speech segment complete
speech_probabilityNUMBEREvery chunk
is_speakingBOOLEANEvery chunk
start_timestamp_secNUMBERWith audio
end_timestamp_secNUMBERWith audio
sample_rate_hzNUMBERAlways 16000
channelsNUMBERAlways 1

SSG outputs

OutputTypeWhen emitted
audioAUDIOAlert fired (analysis window)
event_detectedBOOLEANEvery chunk
event_confidenceNUMBEREvery chunk (best scenario match)
event_labelSTRINGBest matching AudioSet label
event_typeSTRINGActive scenario key
active_scenarioSTRINGSame as event_type
start_timestamp_secNUMBERWith alert
end_timestamp_secNUMBERWith alert
sample_rate_hzNUMBERAlways 16000
channelsNUMBERAlways 1

Audio ingress (shared)

All modalities use _CwAudioIngress.adapt_safe():
  • Canonical format: numpy int16, 16000 Hz, mono, key audio
  • Accepted inputs: int16/float32 numpy, raw PCM bytes, WAV bytes
  • Passthrough: int16 @ 16 kHz mono is zero-copy

VA architecture

_CwAudioStreamProcessor + Silero VADIterator:
  1. Incoming chunks reframed to 512 samples (32 ms @ 16 kHz)
  2. Ring buffer stores float32 audio; indices from VAD mark utterance bounds
  3. On speech end: slice buffer → int16 segment → emit

SSG architecture

_CwAaAstGuard + Hugging Face AST:
  1. Accumulate int16 chunks until window_samples (from buffer preset, default 4 s)
  2. Convert to float32, run ASTFeatureExtractor + ASTForAudioClassification
  3. Multi-label sigmoid over 527 classes; filter by active scenario labels only
  4. If max(matching_probs) >= confidence_threshold → emit alert + audio window
  5. Slide buffer by hop_samples (50% overlap); enforce event_cooldown_s

Model

  • ID: MIT/ast-finetuned-audioset-10-10-0.4593
  • Training: AudioSet (527 classes)
  • Inference: torch.sigmoid(logits) per class

Scenario label maps (built-in)

glass_break      → Glass, Shatter, Smash crash, Thump, Crash, Bang
constant_alarm   → Alarm, Car alarm, Smoke detector, Siren, …
screaming        → Screaming, Yell, Shout, Children shouting
custom           → user-provided substrings
Matching is case-insensitive substring: target in class_label.lower().

Buffer presets (shared field)

PresetSecondsSamples @ 16 kHzTypical modality
noneVA: full utterance
wake-word0.081280VA downstream / not recommended for SSG
speech-to-text4.064000VA STT chunks / SSG default analysis window
customoutput_buffer_size_sUser-defined
SSG enforces a minimum analysis window of 1.0 s.

Alert attribution

On SSG alert, _source_summary includes kind: audio_assistant, modality: sound_security_guard, sub_modality, event_label, and event_confidence for Send Alert source_chain metadata.

Edge cases

CaseBehavior
Empty audio chunkSkipped
SSG during cooldownAnalysis skipped until cooldown expires
Custom scenario with no labelsNo matches; never alerts
Upstream non-16 kHzResampled by audio ingress
VA oversized chunksReframed to 512 samples with one-time warning