By the end of this tutorial you will have a Waveshare UGV Beast that listens to a spoken English command, looks at its workspace through its pan-tilt camera, and executes a short driving plan written on-the-fly by Claude. No dataset, no VLA training, no per-task scripting. The policy is a system prompt; the safety is your validator.Documentation Index
Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt
Use this file to discover all available pages before exploring further.
Reference implementation. The full source will live in
nl_ugv_controller,
side by side with the existing
nl_arm_controller
example. The structure mirrors the arm controller deliberately: same five
modules, same smoke-test layout, same --check / --voice / --vision / --dry-run
CLI surface.Architecture at a glance
Who provides what. Cyberwave provides: digital twin, Python SDK, MQTT
transport, edge ROS2 driver, Live Mode, the unified
actuation command
surface. You provide: the laptop, a microphone, API keys for Anthropic Claude
and Mistral Voxtral, and a UGV Beast already paired to your environment.Goals and scope
What you’ll build. A laptop-side REPL that takes a voice command (hold SPACE to talk), grabs a frame from the UGV’s pan-tilt camera, sends both to Claude as one multimodal request, parses the JSON action plan it returns, validates and clamps it, and runs it on the UGV through the same MQTT control plane the UGV Beast Controller already uses today. In scope- A single Waveshare UGV Beast in a known indoor space (open floor or a marked lane).
- Push-to-talk voice from your laptop and the UGV’s onboard pan-tilt camera as the vision source.
- A bounded vocabulary of discrete driving actions: short forward/backward translations, in-place turns, camera pan and tilt, lights, and stop.
- Visual Q&A grounded in the live frame (“what do you see?”, “is the door open?”), with optional visually-grounded short drives (“drive toward the box”).
- Metric-accurate navigation, SLAM, or waypoint following. This agent reasons in seconds-of-motion, not metres-on-a-map.
- Continuous control via a learned VLA. That’s a separate tutorial that records UGV teleoperation demonstrations and finetunes a Vision-Language-Action model on them; see Next steps.
- Always-on wake word, multi-turn dialogue, dialog memory, or on-robot STT (the Pi-side variant is a follow-up).
- Outdoor / unstructured / GPS-denied terrain.
Prerequisites
This tutorial starts from a UGV Beast that you can already drive with the UGV Beast Controller (the bespoke keyboard policy bundled with the rover, not the generic Keyboard controller). Do not skip that base setup; everything below assumes it is in place.- Hardware: Waveshare UGV Beast (Jetson Orin or Pi variant), microphone on the laptop (built-in is fine), and the UGV’s stock pan-tilt USB camera.
- Credentials: Cyberwave API key, twin and environment UUIDs, Anthropic API key, Mistral API key.
- Base setup: UGV Beast hardware get-started (pair the twin, install Cyberwave Edge, verify control with the UGV Beast Controller).
- SDK baseline: Python SDK.
- Conceptual grounding: Key Concepts, Voice as a Sensor. The arm-side sibling tutorial SO-101 NL Voice + Vision Agent covers the same five-module pattern on a manipulator if you want a working reference to read alongside this one.
You’re ready for this tutorial when the UGV Beast Controller drives your
physical UGV forward / backward / turn in place from W / A / S / D, the
pan-tilt camera responds to I / J / K / L, and the camera’s video stream is
live in the environment viewer.
Project layout
The reference example will live atcyberwave-os/cyberwave-python/examples/nl_ugv_controller, alongside the existing arm controller. Same five-module shape, same smoke-test discipline:
Step 1: Set up the Cyberwave environment
Before you build the agent, you need a working UGV Beast twin and a Cyberwave environment with the edge driver running. The full reference is in the UGV Beast hardware get-started. The short version:- Create a new environment in the Cyberwave dashboard and Add from Catalog → UGV Beast to create the twin with the right capabilities pre-configured.
- SSH into the rover’s compute board, disable Waveshare’s stock orchestration, install Cyberwave Edge, and pair the twin.
- Confirm the pan-tilt camera streams in the environment viewer and the UGV Beast Controller drives the wheels.
- Note the environment UUID and twin UUID from the dashboard. You’ll paste them into
.envin Step 3.
The twin appears in the dashboard, the pan-tilt camera is streaming, the
battery telemetry is updating, and the UGV Beast Controller in Live Mode
moves the physical rover.
Step 2: Verify the UGV Beast Controller path
The agent will speak exactly the same vocabulary the UGV Beast Controller speaks today. Proving that controller end-to-end before you wire in any LLM rules out four moving parts (calibration, pairing, MQTT, edge driver) in two minutes, and it validates the precise set ofactuation strings the planner will emit.
- Open your environment and switch to Live Mode.
- Select the UGV Beast twin and click Assign Controller.
- Pick the UGV Beast Controller from the controller list (not the generic “Keyboard” controller; the UGV one is the rover-specific policy with the right key bindings and widgets pre-wired). The rover is now driven from your browser.
- Drive each binding once: W / A / S / D for forward / left / back / right, I / J / K / L for camera tilt up / pan left / tilt down / pan right, U to recentre the camera, F and R to toggle the chassis and camera lights, E to take a photo, B to refresh the battery widget. If the physical rover responds on every key, the entire cloud-to-edge path is healthy and the full action vocabulary works.
You’re done with Step 2 when every binding on the UGV Beast Controller moves
the physical rover or updates a widget, and a one-liner Python script using
cw.twin("waveshare/ugv-beast", ...).move_forward(distance=0.3) also moves
it. The platform automatically swaps the UGV Beast Controller for the SDK
session when your script connects, so you don’t need to manually unassign
anything.Step 3: Develop your agent
The agent is five small modules on your laptop, each owning one transform. Build and smoke-test them one at a time, then orchestrate.| Module | Input | Output | External service |
|---|---|---|---|
voice.py | Mic audio (while SPACE held) | Transcribed text | Mistral Voxtral |
vision.py | UGV pan-tilt camera | Base64 JPEG | n/a (local OpenCV) |
planner.py | Text + base64 JPEG | Validated ActionPlan | Anthropic Claude |
drive.py | ActionPlan | Clamped SDK calls / MQTT actions | Cyberwave SDK |
nl_ugv_controller.py | CLI flags | Agent loop wiring it all together | n/a |
3.1 Voice as an input modality
voice.py is a push-to-talk mic recorder plus a Mistral Voxtral STT call. Hold SPACE to record, release to transcribe. Output is plain text.
- Recording:
sounddevicecaptures 16-bit PCM at 16 kHz;pynputlistens for the spacebar press and release. The recorder writes a temporary WAV to disk and shows a live RMS meter so you can see whether the mic is hot. - Transcription: the WAV is POSTed to the Voxtral endpoint. Voxtral is OpenAI-Whisper-compatible at the API level, so swapping providers later is one URL change.
- Failure modes: too-short recordings, silent recordings, and STT errors all return an empty string. The orchestrator treats empty input as “skip this turn” rather than crashing.
3.2 Vision as an input modality
vision.py grabs a frame from the UGV’s pan-tilt camera and returns a base64-encoded JPEG ready to drop into a Claude image content block. No model runs locally; OpenCV is just for capture and JPEG encoding.
There are two viable capture paths and the example will pick one based on whether you’re running the agent on your laptop or on the rover’s compute board:
- Laptop-side agent: open the UGV camera over the network from the existing video stream that Cyberwave’s edge driver already exposes (the same stream the dashboard renders).
- On-rover agent: open
/dev/video0directly withcv2.VideoCapture, mirroring the arm tutorial.
vision.py is the same single call: grab_frame_b64() returns a base64 JPEG or None. A failed grab falls back to text-only Claude calls for that turn so the agent stays usable when the camera is down.
3.3 VLM as the planner
planner.py calls Anthropic Claude Sonnet 4.5 with a system prompt, the user transcript, and (optionally) a base64 JPEG. It returns a validated ActionPlan or a structured error.
The action vocabulary is intentionally small and one-to-one with the verbs the edge GenericActuationHandler already accepts. Every action the planner can emit is a command the rover already knows how to execute:
| Category | Allowed actions | Argument |
|---|---|---|
| Locomotion | move_forward, move_backward | distance (metres, capped) |
| Locomotion | turn_left, turn_right | angle (radians, capped) |
| Locomotion | stop, wait | duration (seconds, capped); stop takes none |
| Camera servo | camera_up, camera_down, camera_left, camera_right, camera_default | none (one step per call) |
| Lights | chassis_light_toggle, camera_light_toggle | none |
| Utilities | take_photo, battery_check | none |
- Pin the schema in the system prompt. The system prompt documents every action verb, its argument shape, and its allowed range; lists three to five few-shot examples covering pure description, pure motion, and visually-grounded motion; and forbids markdown, code fences, or any prose outside the JSON object.
- Use a low temperature. Call Claude with
temperature=0.2: deterministic enough for consistent JSON, not so cold that the model loops on a phrasing it doesn’t like. - Cap
max_tokens. A typical plan is well under 200 output tokens; capping at 400 (text-only) or 500 (vision) bounds cost, latency, and worst-case output length all at once.
actions array; explicit motion commands get motion; visually-grounded motion (“drive toward the red box”) gets a short, conservative drive plus a stop, with the rationale spoken in the plan’s say field; references to things the camera doesn’t see get an honest “no, I don’t see that” and an empty actions array.
Treat the LLM as untrusted input. The LLM is the brain; safety is your code. Five layers, defense-in-depth:
- Prompt constraints shape the output before it’s generated (above).
- Defensive parser strips markdown fences, recovers from leading or trailing prose, and returns
(None, error)on any failure. Never raises. - Schema validation rejects unknown action verbs, missing arguments, negative or excessive distances and angles, and plans with more than the configured maximum number of actions. All-or-nothing: any single error means the entire plan is rejected and the rover doesn’t move.
- Per-action clamping squeezes every distance and angle into a conservative envelope right before the SDK call. If Claude emits
distance: 99, the rover receives the configured per-action ceiling (for example, 1.0 m). The rover is physically incapable of executing an out-of-range request. - Try/except containment: any executor exception triggers an immediate
stopand the loop continues.Ctrl+Calways reaches afinallyblock that issues astopand disconnects cleanly.
3.4 Orchestrate it together
drive.py and nl_ugv_controller.py close the loop. The executor turns validated plans into SDK calls; the orchestrator owns the agent loop.
One verb, one SDK call (mostly). The locomotion verbs map one-to-one onto the existing Twin SDK helpers:
move_forward→robot.move_forward(distance=...)move_backward→robot.move_backward(distance=...)turn_left→robot.turn_left(angle=...)turn_right→robot.turn_right(angle=...)stop→ a zero-velocitymove_forward(distance=0.0)
cyberwave/twin/{uuid}/command with source_type set correctly for simulation or real-world via cw.affect(). The camera-servo, lights, and utility verbs don’t have one-liner SDK helpers; the executor publishes the same MQTT payload shape the UGV Beast Controller emits today, which the edge GenericActuationHandler already consumes. Either way, the agent never invents a new control surface. It speaks the protocol the rover already speaks.
The agent loop (in nl_ugv_controller.py):
- Wait for input. With
--voice, hold SPACE to record. Without, type at the REPL. - If
--vision, grab a UGV camera frame. - Send transcript and (optional) frame to Claude.
- Parse, validate, and clamp the returned plan.
- Hand it to the executor; the executor dispatches the clamped actions through the SDK and MQTT.
- Print latency for each step. Loop.
"stop", "halt", "freeze") detected before the planner call, dispatching a direct stop over MQTT without waiting for Claude. Not wired in yet, but the orchestrator’s input hook is the right place for it.
Step 4: Run it live
Pre-flight
Run the environment self-check, then the smoke tests in order:
SDK + UGV (drive forward 0.3 m and stop), Claude planner, Voxtral STT,
spacebar capture, the executor against a hand-written plan, the camera
frame grab (open the saved JPEG to confirm framing and lighting), and
finally the vision-grounded planner against your workspace. Each smoke
test isolates one layer. Fix any failures here, not during a demo.
Clear the lane
Move the rover to the start of a marked lane on the floor. No people,
pets, or cables within the rover’s reachable workspace for the next
planning turn (about 1 m and one in-place turn at default ceilings).
Launch the agent
Start the orchestrator with
--voice --vision. Wait for the banners
confirming the SDK is connected, the camera is ready, and the planner
has been initialised.Speak a command
Hold SPACE, say “drive forward a little, then turn right”, release.
Watch the terminal print the transcript, the planner latency, and the
JSON plan, then watch the rover execute it and stop.
Try vision-grounded prompts
“What do you see in front of you?”: expect a description, no
motion. “Is there a box ahead?”: expect a yes/no grounded in the
actual frame. “Drive toward the box.”: expect a short forward plus a
final
stop, with the rationale spoken in the plan’s say field.stop, drive forward a little, turn right, look up, what do you see?, drive toward the box.
Safety and operational notes
- Default to dry-run while iterating. The
--dry-runflag runs the entire pipeline (voice, vision, planning, validation) and prints the clamped plan without ever publishing to MQTT. Use it whenever you’re changing the prompt or the executor. - Physical E-stop. Keep one reachable. Software-only stops are for convenience, not safety.
- Simulation first after any change. After any change to the prompt, the planner, or the executor, run the new flow against the digital twin in
cw.affect("simulation")before the physical rover. The same code drives both. - Lane discipline. Per-action distance and angle ceilings only protect against worst-case single actions. A plan with eight maximal forward steps will still travel up to 8 m. Run in a marked, cleared lane until you trust your prompt.
- What invalidates your setup: the pan-tilt camera moved, lighting changed dramatically since you last took a reference frame, or the rover was repaired or recalibrated. Re-run smoke tests 01 and 07 before resuming.
Next steps
Each of these is a follow-up tutorial’s worth of work. Pick one after you hit the success criteria from Goals and scope.- Train a VLA on UGV teleop demos. When you need learned, continuous-control behaviour instead of discrete verbs (smoother trajectories, better grounded navigation), record teleoperated episodes with Cyberwave’s data-recording infrastructure and finetune a Vision-Language-Action model on them. The arm side of this lives at SO-101 Voice Pick-and-Place; the UGV equivalent will land as its own page.
- Autonomous navigation primitives. Replace per-action
distanceandanglewith goal-oriented verbs likenavigate_to(waypoint)once you have a map. Pairs naturally with Rover AI Inspection. - On-rover microphone (Pattern B). Mic on the rover, STT on the edge, transcripts recorded as telemetry. See Voice as a Sensor.
- Local STT and local VLM. Swap Voxtral for the SDK’s bundled Whisper runtimes and Claude for a local VLM via Ollama for a fully on-device agent.
- Wake-word gating with a sub-200 ms
"stop"channel published directly on MQTT, bypassing the main planner. - Workflow integration. Trigger Cyberwave Workflows from natural language. See Workflows.
Reference
- Example folder (forthcoming):
cyberwave-os/cyberwave-python/examples/nl_ugv_controller. - Edge action surface:
GenericActuationHandler: the canonical list of UGV verbs the agent is allowed to emit. - SDK calls used:
Cyberwave(),cw.affect(),cw.twin("waveshare/ugv-beast", ...),robot.move_forward(),robot.move_backward(),robot.turn_left(),robot.turn_right(). See Python SDK. - MQTT topics published by the SDK:
cyberwave/twin/{uuid}/commandwithsource_typeset totele(real-world) orsim_tele(simulation). See MQTT API. - External services: Anthropic Claude Sonnet 4.5, Mistral Voxtral.
- Sibling tutorial: SO-101 NL Voice + Vision Agent: same five-module pattern on a manipulator. The first place to look while you wait for the reference code under this tutorial to land.
- Cross-links: UGV Beast hardware get-started, Voice as a Sensor, Key Concepts, Teleoperation, Simulation.