By the end of this tutorial you will have an SO-101 that listens to a spoken English command, looks at its workspace through a USB webcam, and executes a short motion plan written on-the-fly by Claude. No dataset, no VLA, no training. The policy is a system prompt.Documentation Index
Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt
Use this file to discover all available pages before exploring further.
Reference implementation. Full source for every module and smoke test referenced below lives in the
nl_arm_controller example in the cyberwave-os/cyberwave-python repo. Clone it to follow along, or use the snippets in each step as a guide for your own implementation.Architecture at a glance
Who provides what. Cyberwave provides: digital twin, Python SDK, MQTT transport, edge driver, Live Mode. You provide: the laptop, the USB webcam, API keys for Anthropic Claude and Mistral Voxtral.
Goals and scope
What you’ll build. A laptop-side REPL that takes a voice command (hold SPACE to talk), grabs a webcam frame, sends both to Claude as one multimodal request, parses the JSON motion plan it returns, validates it, and runs it on the SO-101 with smooth ramped motion. In scope- Single SO-101 follower in a fixed workspace.
- Push-to-talk voice + a Mac-side USB webcam looking at the workspace.
- Expressive, gesture-style motions (wave, bow, point, look-toward).
- Visual Q&A grounded in the camera frame (“what do you see?”, “is there a red cup?”).
- Manipulation / grasping driven by vision (use a VLA: see SO-101 Voice Pick-and-Place).
- Camera-to-arm calibration for millimetre-accurate pointing.
- Always-on wake word, multi-turn dialogue, or memory across turns.
Prerequisites
- Hardware: SO-101 follower on a Raspberry Pi, one USB webcam plugged into your laptop, microphone (built-in is fine).
- Credentials: Cyberwave API key + twin/environment UUIDs, Anthropic API key, Mistral API key.
- Base setup: SO-101 Get Started for the environment, edge install, and
so101-remoteoperaterunning on the Pi. - SDK baseline: Python SDK.
- Conceptual grounding: Key Concepts, Voice as a Sensor.
Project layout
The reference example lives atcyberwave-os/cyberwave-python/examples/nl_arm_controller:
nl_arm_controller.py · motion.py · planner.py · vision.py · voice.py · requirements.txt · .env.example · smoke_tests/
Each module is independent; that’s why each one has its own smoke test. If a piece breaks, you isolate it in seconds.
Step 1: Set up the Cyberwave environment
Before you build the agent, you need a working SO-101 twin and a Cyberwave environment with the edge driver running. The full reference is in SO-101 Get Started. The short version:- Create a new environment in the Cyberwave dashboard and add an SO101 twin from the catalog. Position it to roughly match your physical setup.
- Install Cyberwave Edge on the Raspberry Pi paired to the follower:
- Pair the SO101 twin and calibrate the follower (dashboard or CLI). See Calibrate the arms.
- Note the environment UUID and twin UUID from the dashboard. You’ll paste them into
.envin Step 3.
The twin appears in the dashboard, calibration is green, and
so101-remoteoperate is running on the Pi (visible as an active driver under the twin).Step 2: Set up remote operation
The agent needs to send joint commands from your laptop to the follower without a physical leader arm. That means assigning a controller to the twin in the dashboard and verifying the cloud-to-edge path with the keyboard before you wire any LLM.- Open your environment and switch to Live Mode.
- Select the SO101 twin and click Assign Controller.
- Pick Keyboard from the controller list. The follower arm is now driven by your browser.
- Use the on-screen key bindings to nudge each joint. If the physical arm responds, the cloud, MQTT, and edge driver are all healthy.
You’re done with Step 2 when keyboard input in the dashboard moves the physical arm, and a one-liner Python script using
cw.twin(...).joints.set("1", 10, degrees=True) also moves it. The platform automatically swaps the keyboard controller for the SDK session when your script connects, so you don’t need to manually unassign anything.Step 3: Develop your agent
The agent is five small modules on your laptop, each owning one transform. Build and smoke-test them one at a time, then orchestrate.| Module | Input | Output | External service |
|---|---|---|---|
voice.py | Mic audio (while SPACE held) | Transcribed text | Mistral Voxtral |
vision.py | Webcam | Base64 JPEG | n/a (local OpenCV) |
planner.py | Text + base64 JPEG | Validated MotionPlan | Anthropic Claude |
motion.py | MotionPlan | Ramped joint commands | Cyberwave SDK |
nl_arm_controller.py | CLI flags | Agent loop wiring it all together | n/a |
CW_CAMERA_INDEX if your USB webcam isn’t device 0.
3.1 Voice as an input modality
voice.py is a push-to-talk mic recorder plus a Mistral Voxtral STT call. Hold SPACE to record, release to transcribe. Output is plain text.
- Recording:
sounddevicecaptures 16-bit PCM at 16 kHz;pynputlistens for the spacebar press/release. The recorder writes a temporary WAV to disk. - Transcription: the WAV is POSTed to the Voxtral endpoint. Voxtral is OpenAI-Whisper-compatible at the API level, so swapping providers later is one URL change.
- Failure mode: silent recording or permission errors return an empty string. The orchestrator treats empty input as “skip this turn”.
3.2 Vision as an input modality
vision.py wraps cv2.VideoCapture and exposes grab_frame_b64(), which returns a base64-encoded JPEG ready to drop into a Claude image content block. No model runs locally; OpenCV is just for capture and JPEG encoding.
- Resolution: defaults to whatever the webcam reports; downscaled to 1024 px on the long edge before encoding to keep token usage reasonable.
- Lifecycle: the orchestrator opens the camera once at startup and reuses the handle. A failed grab returns
None, and the orchestrator falls back to text-only Claude calls for that turn.
3.3 VLM as the planner
planner.py calls Anthropic Claude Sonnet 4.5 with a system prompt, the user transcript, and (optionally) a base64 JPEG. It returns a validated MotionPlan or a structured error.
The whole tutorial hinges on Claude returning a strict, machine-parseable motion plan, not prose. We do this in three layers, before any code parses anything.
1. Pin the schema in the system prompt. planner.SYSTEM_PROMPT documents the 6 joints with directional semantics (joint "1" positive = turn RIGHT), four allowed action types (set_joint, set_pose, wait, home), and three few-shot examples covering single-joint, multi-joint, and “stop” semantics. It also forbids markdown, code fences, and any prose outside the JSON object.
2. Use a low temperature. We call Claude with temperature=0.2: deterministic enough for consistent JSON, not so cold that the model loops.
3. Cap max_tokens. A typical plan is ~120 output tokens. We allow 400 (text-only) or 500 (vision). This caps cost, latency, and worst-case output length all at once.
Example plan Claude returns for “wave at the audience”:
VISION_SYSTEM_PROMPT) adds decision rules: “describe only: empty actions”, “motion only: narrate briefly”, “visually-grounded motion: describe + aim with joint 1”, plus a rule for honest “no, I don’t see that” answers with empty actions.
Treat the LLM as untrusted input. The LLM is the brain; safety is your code. Five layers, defense-in-depth:
- Prompt constraints: shape the output before it’s generated (above).
- Defensive parser:
parse_plan_jsonstrips markdown fences, recovers from leading/trailing prose, and returns(None, error)on any failure. Never raises. - Schema validation:
validate_planrejects unknown action types, unknown joints, negative or excessive durations, and plans with more than 8 actions. All-or-nothing: any single error means the entire plan is rejected and the arm doesn’t move. - Per-joint clamping: every commanded angle is squeezed into
DEFAULT_JOINT_LIMITSright before the SDK call. If Claude emitsangle: 9999, the arm receives90. The arm is physically incapable of executing an out-of-range request. - Try/except containment: any executor exception triggers a
home(duration=1.0)and the loop continues.Ctrl+Calways reaches thefinallyblock that homes the arm and disconnects cleanly.
3.4 Orchestrate it together
motion.py and nl_arm_controller.py close the loop. The executor turns plans into smooth motion; the orchestrator owns the agent loop.
Smooth motion with ramping. The motion executor never commands a target pose in one step. Each action is split into duration × 20 linear-interpolation steps at 20 Hz, with time.sleep(50ms) between them. Multi-joint moves share the same t ∈ [0, 1] parameter so all joints arrive at the target simultaneously.
Why this matters:
- Visual quality: the arm sweeps through every intermediate angle instead of snapping to the target at servo top speed.
- Hardware kindness: gradual joint changes are easier on the gearbox.
- Predictable timing:
duration=1.5means the move takes 1.5 seconds and ends at the target, not earlier.
cyberwave/twin/{uuid}/joint_states. The Pi’s so101-remoteoperate is subscribed to that topic and drives the servos via /dev/ttyACM0.
The agent loop (in nl_arm_controller.py):
- Wait for input. With
--voice, hold SPACE to record. Without, type at the REPL. - If
--vision, grab a webcam frame. - Send transcript + (optional) frame to Claude.
- Parse, validate, and clamp the returned plan.
- Hand it to the executor; the executor sends ramped commands through the SDK.
- Print latency for each step. Loop.
Step 4: Run it live
Aim the webcam
Point the USB webcam at the SO-101 workspace. Use the JPEG saved by smoke test 07 to confirm framing and lighting.
Speak a command
Hold SPACE, say “wave at the audience”, release. Watch the terminal print the transcript, the planner latency, and the JSON plan, then watch the arm execute it.
Try vision-grounded prompts
“what do you see in front of you?”: expect a description, no motion. “is there a red cup?”: expect a yes/no grounded in the actual frame. “look at the red cup”: expect a small base rotation toward whatever Claude identifies.
home the arm, wave at the audience, do a small bow, what do you see?, is there a red cup?, look at the [object].
Next Steps
Each is a follow-up tutorial’s worth of work. Pick one after you hit the success criteria from Goals and scope.- Pick-and-place with a VLA: when you need learned manipulation, see SO-101 Voice Pick-and-Place.
- Camera-to-arm calibration: replace approximate “point at” with metric pointing using a checkerboard or AprilTag.
- Multi-turn memory: remember “do that again” by keeping plan history in a small in-process store.
- On-robot microphone (Pattern B): mic and STT on the edge, transcript recorded as telemetry. See Voice as a Sensor.
- Swap the planner: GPT-4o, Gemini 2.5, or a fine-tuned local model. The planner module is one HTTP call, easy to replace.
- Workflow integration: trigger Cyberwave Workflows from natural language. See Workflows.
Reference
- Example folder:
cyberwave-os/cyberwave-python/examples/nl_arm_controller(smoke tests undersmoke_tests/) - SDK calls used:
Cyberwave(),cw.affect(),cw.twin(),robot.joints.set(). See Python SDK. - MQTT topics published by the SDK:
cyberwave/twin/{uuid}/joint_states(source_typetele). See MQTT API. - External services: Anthropic Claude Sonnet 4.5, Mistral Voxtral.
- Cross-links: SO-101 Get Started, SO-101 Voice Pick-and-Place, Voice as a Sensor, Key Concepts, Teleoperation.