Architecture at a glance
Who provides what. Cyberwave provides: environment, digital twin, dataset recording, training, controller-policy deployment, MQTT command path. You provide: the mic, the STT engine, and the transcript normaliser.
Goals and scope
What you’ll build. A laptop-side voice agent that listens for a short command, turns it into text, sends that text as aninstruction to a deployed VLA, and watches the SO-101 execute a pick-and-place on your workspace.
In scope
- A single SO-101 follower with wrist and top-down cameras.
- A fixed tabletop workspace with 2-3 known object and target pairs.
- A bounded vocabulary of ~5-10 phrasings trained into the VLA.
- Client-side microphone (Pattern A).
- Bimanual / multi-arm coordination.
- 6-DOF pose targets or free-form grasping.
- On-robot microphone (Pattern B), always-on wake-word, and dialog.
- Outdoor / unconstrained workspaces.
Prerequisites
This tutorial starts from a working SO-101 teleoperation setup. Do not skip the base setup; everything below assumes it is in place.- Hardware: SO-101 leader + follower, wrist camera, top-down camera, workspace rectangle marked on the table.
- Credentials: Cyberwave API key, workspace with VLA training enabled.
- Base setup: SO-101 Get Started (environment, edge install, calibration, teleop).
- SDK baseline: Python SDK.
- Conceptual grounding: Key Concepts, Voice as a Sensor.
You’re ready for this tutorial when you can teleoperate the SO-101 follower from the leader and record an episode in Live Mode.
Design the task
Spend 20 minutes here before touching any robot. Decisions you lock in now become training constraints for the rest of the tutorial. Fill in this worksheet:| Decision | Your answer |
|---|---|
| Object and target pairs (2-3) | e.g. red block → blue cup, pen → left drawer |
| Prompt vocabulary per pair (3-5 phrasings) | e.g. “pick the red block and place it in the blue cup”, “put the red block in the blue cup” |
| Camera pose (wrist + top-down) | Lock, mark mount positions, photograph |
| Lighting | One fixed lamp, shades closed, time-of-day independent |
| Workspace extent | Taped rectangle on the table, dimensions written down |
| Episode length target | e.g. 8-12 seconds |
| Episodes per pair | e.g. 30-50 to start |
Collect the dataset (teleop only)
Voice is not in the loop yet. This phase is pure teleoperation plus recording.- Put the follower at zero pose and confirm both camera streams are live in the environment viewer.
- Open Live Mode and start recording.
- Using the leader, perform the task once per episode. One object-and-target pair, clean trajectory, no resets mid-episode.
- Label the episode with the exact prompt phrasing you plan to use at inference time. Pick one phrasing per episode from the vocabulary you locked in Design the task.
- Stop recording; trim dead time at start and end; discard any episode where the gripper slipped, the object moved unexpectedly, or you intervened.
- Repeat across all object-and-target pairs and all phrasings. Aim for balanced counts per pair.
- Consistent episode length, consistent starting pose.
- Same phrasing label for all episodes of the same pair and intent.
- Reject failures at recording time, not at training time.
Train and deploy the VLA
- In the dashboard, open AI → Train Model and point it at the dataset you recorded in Collect the dataset.
- Pick a base model. Start with a smaller VLA (SmolVLA-class) for faster iteration; graduate to Pi0 or OpenVLA only if the small model caps out.
- Start training. Watch validation loss. A healthy curve drops for the first few epochs and then flattens; if it never flattens, you likely need more data or more consistent labels, not more training time.
- When training completes, click Deploy as controller policy. Save the returned
controller_policy_uuid. - Attach the policy to the SO-101 follower twin. This can be done from the twin’s settings in the dashboard, or by setting
controller_policy_uuidon the twin via the API (seeTwinSchema). - Verify before voice. From the dashboard, type one of your trained phrasings into the policy’s execute panel and run it in simulation. Then run it live. If this does not work, voice won’t either; fix it here before moving on.
Introduce voice as a sensor
The dashboard prompt from Train and deploy the VLA already proved the policy works. Now we replace the dashboard prompt with a transcript from a microphone. Everything downstream (policy, twin, MQTT, hardware) is unchanged. Read the framing in Voice as a Sensor for the generic pattern. This tutorial uses Pattern A (client-side microphone) because it is the fastest way to get a working operator flow with no changes on the edge device or new sensor twins. Operator experience. Operator speaks → 1-2 second pause → arm executes. No push-to-talk required if you put a VAD in front of the STT. No cloud round-trip if your STT runs locally.Wire the voice agent
You are assembling four components on your laptop:- Mic capture — any audio input library that gives you a 16 kHz mono float array.
- VAD (optional but recommended) — gate the STT so it only transcribes when speech is detected. Avoids hallucinated transcripts on silence.
- STT engine — local recommended.
base.enWhisper is fine for a headset in a quiet room; move tosmall.enfor a noisy lab. - Normaliser — a tiny function that maps arbitrary heard text to the closest phrasing from your locked vocabulary. Start with keyword matching; graduate to a one-shot LLM classify if phrasings diverge.
- Dispatcher — POSTs the normalised text as an
instructionto/api/v1/controller-policies/{uuid}/executewith yourtwin_uuid. SeeControllerPolicyExecuteSchemafor the payload shape.
Run the demo
Print this checklist. Re-read it every demo day.Pre-flight
- Workspace cleared, objects at the starting positions you locked in Design the task.
- Both cameras streaming in the environment viewer.
- Controller policy attached to the follower; last successful dashboard-prompt run was within the hour.
- Arm at zero pose. No alerts active on the twin.
Start the voice agent
Launch the Python process you built in Wire the voice agent. Confirm the normaliser mapping table prints at startup.
Observe execution
Watch the Live View twin move first; the physical arm follows over MQTT. Do not stand inside the workspace rectangle.
- No arm motion at all → dispatcher or policy attach (see Debugging playbook).
- Arm reaches the wrong object → VLA dataset coverage (see Evaluate and iterate).
- Arm hesitates then drifts → workspace or camera pose moved since training.
Safety and operational notes
- Controller-attach behaviour. Zero-pose and collision detection are enabled on attach. Queue-up of commands during this transition is expected.
- Physical E-stop. Keep it reachable. Software-only stops are for convenience, not safety.
- Simulation-first discipline. After any change to the voice agent, dataset, or policy, run the new flow against the digital twin before the physical arm. See Simulation.
- When alerts fire on the twin, pause the voice agent, resolve or acknowledge the alert, re-verify at zero pose, then resume.
- What invalidates your setup: camera moved, table moved, new lamp position, recalibration, major lighting change. Any one of these requires a re-verification pass against the baseline you locked in Design the task.
Debugging playbook
Scan top to bottom. Most failures land in the top three rows.| Symptom | Likely layer | What to check |
|---|---|---|
| Nothing heard / empty transcript | Voice | Correct mic selected, VAD threshold, STT model loaded |
| Wrong words in transcript | Voice | STT model size, ambient noise, mic placement |
| Ambiguous phrasing accepted | Normaliser | Vocabulary drift, missing synonym rule, confidence threshold |
| Arm reaches but misses | VLA | Dataset coverage, camera pose matches training, object appearance |
| Arm hesitates or ignores object | VLA | Lighting changed, workspace shifted, object out of distribution |
| 403 / 400 on execute | Platform | API key, policy attached to twin, payload shape |
| Controller not attached | Platform | Twin has a controller_policy_uuid set and policy is deployed |
| Camera not live | Platform / Hardware | Camera twin, edge driver, USB bandwidth |
| Motor hot, servo jitter, gripper slip | Hardware | Duty cycle, calibration, gripper wear |
Evaluate and iterate
Do not iterate on feel. Measure.- Baseline run: 10 attempts × 3 starting positions × each object-and-target pair. Log success / failure per attempt.
- Diagnose: bucket failures by likely layer using the Debugging playbook table. A pattern of “arm reaches wrong object” says re-record; a pattern of “wrong words” says the STT or normaliser is the problem, not the VLA.
- Enrich vs adjust vs retrain:
- Enrich the dataset when you see consistent VLA-layer failures on a specific pair or position.
- Adjust the prompt vocabulary / normaliser when the failure is in the text layer.
- Retrain only after enrichment, never as a first response.
- Capture failures automatically. Publish the heard transcript, the normalised instruction, and the
action_idon telemetry every run (success or failure). See MQTT API for thecyberwave/twin/{uuid}/telemetrytopic. Failed-run telemetry becomes your next dataset expansion list.
Where to go next
Each of these is a follow-up tutorial’s worth of work. Pick one after you hit the success criteria you set in Goals and scope.- Multi-step tasks: chaining pick-and-place actions in a single utterance.
- Dialog-style voice: asking the robot to confirm, clarify, or report status.
- On-robot microphone (Pattern B): mic on the robot, STT on the edge, telemetry-recorded transcripts.
- Wake-word gating: always-on with a sub-200ms “stop” channel on MQTT.
- Human handover as a primitive: “hand me the screwdriver”.
- Dataset aggregation across operators: merging multi-operator datasets into one VLA.
- Swapping the VLA: moving SmolVLA → Pi0 → OpenVLA without changing the voice layer.
Reference
- Endpoints:
ControllerPolicyExecuteSchema,TwinSchema,WorkflowExecuteSchema. - MQTT topics:
cyberwave/twin/{uuid}/telemetry,cyberwave/twin/{uuid}/command. See MQTT API. - SDK:
cw.twin,robot.capture_frame,robot.joints,twin.alerts. See Python SDK. - Cross-links: SO-101 Get Started, Voice as a Sensor, Teleoperation, Simulation, Key Concepts.