> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cyberwave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Import Datasets

> Import robot datasets from HuggingFace Hub or upload zip files.

<Info>
  **Robot datasets first.** Cyberwave currently focuses on robot manipulation and navigation datasets (LeRobot, RLDS, Cyberwave Parquet, and similar time-series formats). Image classification, video, audio, and multimodal dataset formats are detected on import but full support — playback, conversion, and training — is coming in a future release.
</Info>

## Overview

Cyberwave supports importing datasets from two sources:

* **HuggingFace Hub** — Import directly by repository ID (lazy, no multi-GB copy)
* **Zip Upload** — Upload a pre-packaged dataset archive (max 10 GB)

Format detection happens automatically on the server side and sets the `source_format` field on the dataset record.

## Supported formats

### Robot datasets (full support)

These formats are fully supported for import, playback, conversion, and ML training.

| Format            | `source_format` value | Description                                                    |
| ----------------- | --------------------- | -------------------------------------------------------------- |
| LeRobot v3        | `lerobot3`            | Latest LeRobot format — parquet episodes + `meta/info.json`    |
| LeRobot v2.1      | `lerobot21`           | Legacy LeRobot format with splits (normalised to v3 on export) |
| RLDS / TFDS       | `rlds`                | TensorFlow Datasets format (Open-X-Embodiment style)           |
| Cyberwave Parquet | `cyberwave_parquet`   | Native format for datasets generated directly on Cyberwave     |
| HDF5              | `hdf5`                | Robomimic / ACT / ALOHA style datasets                         |
| Zarr              | `zarr`                | Diffusion Policy / UMI style datasets                          |
| GR00T             | `gr00t`               | NVIDIA Isaac GR00T (LeRobot v2 + embodiment metadata)          |
| RoboDM            | `robodm`              | Berkeley .vla format                                           |
| MCAP              | `mcap`                | ROS2 CDR + Foxglove Protobuf                                   |
| ROS bag           | `rosbag`              | ROS1 .bag / ROS2 SQLite3                                       |

### Other dataset types (detection only — full support coming later)

Cyberwave detects these formats and stores the `source_format` value, but playback, conversion, and training pipelines for them are not yet available.

| Source format                                             | Description                            |
| --------------------------------------------------------- | -------------------------------------- |
| `coco_detection`                                          | COCO object detection JSON             |
| `yolov4` / `yolov5`                                       | YOLO format datasets                   |
| `voc_detection`                                           | Pascal VOC                             |
| `kitti_detection`                                         | KITTI                                  |
| `image_classification_directory_tree`                     | ImageNet-style class folders           |
| `image_directory` / `video_directory` / `media_directory` | Plain media folders                    |
| `image_segmentation_directory`                            | Per-pixel masks alongside images       |
| `cvat_image` / `cvat_video`                               | CVAT exports                           |
| `openlabel_image` / `openlabel_video`                     | OpenLABEL exports                      |
| `bdd`, `csv`, `dicom`, `geojson`, `geotiff`               | Specialty formats                      |
| `unknown`                                                 | Detector could not classify the layout |

## Import from HuggingFace Hub

1. Navigate to your workspace and click **Import Dataset**
2. Select **HuggingFace Hub** as the source
3. Enter the repository ID (e.g. `lerobot/pusht`)
4. Click **Import**

HuggingFace imports are **lazy by default**: Cyberwave queries the HF API for tags, card data, and the file list, and for LeRobot datasets reads only the small `meta/*.json` files. The dataset card appears immediately with episode counts, FPS, robot type, and cameras — without copying the multi-GB payload. Frames are fetched on demand when a training run or visualisation needs them. The `metadata.import.materialized` flag stays `false` until a follow-up materialisation step runs.

<Info>
  For private HuggingFace repositories, ensure your organisation has configured the HuggingFace token in the deployment settings.
</Info>

## Upload a zip file

1. Navigate to your workspace and click **Import Dataset**
2. Select **Upload Zip** as the source
3. Select your zip file (max 10 GB)
4. Click **Upload**

The system automatically detects the format after the upload completes.

## Monitor progress

After starting an import you can track it via:

* The dataset list view (status indicator)
* The dataset detail page (`metadata.upload_progress`)

## API reference

<Card title="POST /datasets/import/init" icon="code">
  Initialise a dataset import. Returns a signed URL for zip uploads, or immediately starts an HF import.
</Card>

<Card title="POST /datasets/import/complete" icon="code">
  Complete a zip upload import after the file has been uploaded to the signed URL.
</Card>

See the [API Reference](/api-reference/rest/DefaultApi) for full details.
