Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

SmolVLA is a lightweight Vision-Language-Action model optimized for edge deployment. Cyberwave supports fine-tuning SmolVLA models on your custom datasets using the LeRobot v3 dataset format.
SmolVLA models use the LeRobot v3 dataset format for training, which differs from the TFDS format used by OpenVLA models. The platform handles format conversion automatically.

Model Selection

When starting a new training:
  1. Navigate to AI → Training in your environment
  2. Select SmolVLA from the available model architectures
  3. The platform will automatically use the LeRobot training pipeline
Only models with is_trainable: true appear in the training model selection. SmolVLA is pre-configured as trainable.

Dataset Conversion

When you start training with a SmolVLA model, the platform:
  1. Joins your episode parquet files into a single dataset
  2. Converts the OpenVLA-format parquet to LeRobot v3 format
  3. Handles camera role mapping (primary, wrist, secondary)
  4. Encodes video frames using AV1 codec (configurable)
The conversion is cached — subsequent trainings with the same dataset and configuration skip the conversion step.

Training Parameters

SmolVLA training supports:
ParameterDescriptionDefault
fpsTarget frames per second30
use_videosStore frames as MP4 videostrue
vcodecVideo codeclibsvtav1
num_camerasNumber of camera streams (1-3)1

Camera Configuration

Camera roles are mapped to LeRobot conventions:
  • primaryobservation.images.primary
  • wristobservation.images.wrist
  • secondaryobservation.images.secondary
Configure camera roles in the training wizard or via the API.

Training Workflow


Deployment

After training completes:
  1. Deploy the trained model as a controller policy
  2. Assign the VLA controller to your robot twin
  3. Use natural language prompts to control the robot
SmolVLA models are optimized for edge inference, making them suitable for real-time robot control with lower latency than larger VLA models.