Technical Reference - Cyberwave Docs

Node inputs (configuration)

Stored in node.parameters unless wired from upstream:

Field	VA	SSG	Description
`modality`	✓	✓	`voice_assistant` or `sound_security_guard`
`sub_modality`	✓	✓	VAD profile or security scenario
`vad_*`	✓	—	Silero thresholds (advanced)
`output_buffer_preset`	✓	✓	VA: output re-chunking; SSG: analysis window
`output_buffer_size_s`	✓	✓	Custom buffer/window seconds
`confidence_threshold`	—	✓	SSG alert threshold (0–1)
`event_cooldown_s`	—	✓	SSG seconds between alerts
`custom_event_labels`	—	✓	SSG custom scenario label list

Runtime audio is resolved automatically from upstream audio via _CwAudioIngress (no manual sample_rate_hz input).

VA outputs

Output	Type	When emitted
`audio`	AUDIO (int16 @ 16 kHz mono)	Speech segment complete
`speech_probability`	NUMBER	Every chunk
`is_speaking`	BOOLEAN	Every chunk
`start_timestamp_sec`	NUMBER	With `audio`
`end_timestamp_sec`	NUMBER	With `audio`
`sample_rate_hz`	NUMBER	Always `16000`
`channels`	NUMBER	Always `1`

SSG outputs

Output	Type	When emitted
`audio`	AUDIO	Alert fired (analysis window)
`event_detected`	BOOLEAN	Every chunk
`event_confidence`	NUMBER	Every chunk (best scenario match)
`event_label`	STRING	Best matching AudioSet label
`event_type`	STRING	Active scenario key
`active_scenario`	STRING	Same as `event_type`
`start_timestamp_sec`	NUMBER	With alert
`end_timestamp_sec`	NUMBER	With alert
`sample_rate_hz`	NUMBER	Always `16000`
`channels`	NUMBER	Always `1`

Audio ingress (shared)

All modalities use _CwAudioIngress.adapt_safe():

Canonical format: numpy int16, 16000 Hz, mono, key audio
Accepted inputs: int16/float32 numpy, raw PCM bytes, WAV bytes
Passthrough: int16 @ 16 kHz mono is zero-copy

VA architecture

_CwAudioStreamProcessor + Silero VADIterator:

Incoming chunks reframed to 512 samples (32 ms @ 16 kHz)
Ring buffer stores float32 audio; indices from VAD mark utterance bounds
On speech end: slice buffer → int16 segment → emit

SSG architecture

_CwAaAstGuard + Hugging Face AST:

Accumulate int16 chunks until window_samples (from buffer preset, default 4 s)
Convert to float32, run ASTFeatureExtractor + ASTForAudioClassification
Multi-label sigmoid over 527 classes; filter by active scenario labels only
If max(matching_probs) >= confidence_threshold → emit alert + audio window
Slide buffer by hop_samples (50% overlap); enforce event_cooldown_s

Model

ID: MIT/ast-finetuned-audioset-10-10-0.4593
Training: AudioSet (527 classes)
Inference: torch.sigmoid(logits) per class

Scenario label maps (built-in)

glass_break      → Glass, Shatter, Smash crash, Thump, Crash, Bang
constant_alarm   → Alarm, Car alarm, Smoke detector, Siren, …
screaming        → Screaming, Yell, Shout, Children shouting
custom           → user-provided substrings

Matching is case-insensitive substring: target in class_label.lower().

Buffer presets (shared field)

Preset	Seconds	Samples @ 16 kHz	Typical modality
`none`	—	—	VA: full utterance
`wake-word`	0.08	1280	VA downstream / not recommended for SSG
`speech-to-text`	4.0	64000	VA STT chunks / SSG default analysis window
`custom`	`output_buffer_size_s`	—	User-defined

SSG enforces a minimum analysis window of 1.0 s.

Alert attribution

On SSG alert, _source_summary includes kind: audio_assistant, modality: sound_security_guard, sub_modality, event_label, and event_confidence for Send Alert source_chain metadata.

Edge cases

Case	Behavior
Empty audio chunk	Skipped
SSG during cooldown	Analysis skipped until cooldown expires
Custom scenario with no labels	No matches; never alerts
Upstream non-16 kHz	Resampled by audio ingress
VA oversized chunks	Reframed to 512 samples with one-time warning

​Node inputs (configuration)

​VA outputs

​SSG outputs

​Audio ingress (shared)

​VA architecture

​SSG architecture

​Model

​Scenario label maps (built-in)

​Buffer presets (shared field)

​Alert attribution

​Edge cases