Pipeline position
Modalities
| Modality | Purpose | Engine |
|---|---|---|
| Voice Assistant (VA) | Segment speech utterances | Silero VAD |
| Sound Security Guard (SSG) | Detect security-related acoustic events | MIT AST (AudioSet) |
Shared audio contract
| Direction | Key | Format |
|---|---|---|
| Input | audio | PCM S16LE numpy int16, float32, raw bytes, or WAV—adapted to int16 @ 16 kHz mono |
| Output | audio | int16 mono @ 16 kHz (when a segment or alert window is emitted) |
| Output | sample_rate_hz | Always 16000 |
| Output | channels | Always 1 |
| Preset | Chunk size | Typical use |
|---|---|---|
| Voice Assistant (32 ms) | 512 samples | VA streaming (matches Silero frame size) |
| Wake Word (80 ms) | 1280 samples | Wake Word Engine |
| Speech-To-Text (4 s) | 64000 samples | Batch / long windows |
Edge execution
Audio Assistant nodes are edge-only. Workflows compile to awf_*.py worker module and sync to the device:
- VA upstream Audio Track must use buffer preset Voice Assistant (32 ms).
- SSG upstream Audio Track should use Speech-To-Text (4 s) (or custom ≥1 s).
Edge dependencies
zenoh, ml-vad, ml-aed, and ml-wakeword, and pre-downloads AST and OpenWakeWord weights for air-gapped use.
SSG on bare-metal edges downloads MIT/ast-finetuned-audioset-10-10-0.4593 on first run (~340 MB) unless baked into the image.