vla-video-data - Ryan Chung

Video data collection and processing pipeline for training and fine-tuning vision-language-action models. VLAs learn manipulation and navigation policies from paired video-language-action data, and the quality of that data directly determines model performance. This project builds the tooling to gather, process, and curate video datasets suitable for VLA training.

Motivation

Training a VLA from scratch or fine-tuning an existing one on new tasks requires large volumes of video paired with language instructions and action labels. Off-the-shelf datasets cover a limited set of environments and embodiments. To push the vla-world-model-control project further, a dedicated data pipeline is needed—one that can ingest video from simulation recordings, real-world demonstrations, and public robotics datasets, then normalize everything into a consistent format for training.

Video Collection

The pipeline collects video from multiple sources: task recordings from NVIDIA Isaac Sim, publicly available robotics demonstration datasets (such as DROID, Open X-Embodiment, and Bridge V2), and manually recorded demonstrations. Each video is paired with natural-language task descriptions and, where available, action trajectories (end-effector poses, joint angles, or discrete actions).

Processing and Annotation

Raw video goes through a processing pipeline that handles frame extraction, resolution normalization, and temporal subsampling to match the input requirements of target VLA architectures. Action labels are aligned to video frames, and language annotations are validated for consistency. The pipeline outputs training-ready data in formats compatible with standard VLA training codebases.

Dataset Curation

Not all data is equally useful. The curation stage filters for task diversity, demonstration quality, and action coverage. Duplicate or near-duplicate episodes are removed, failed demonstrations are flagged, and the dataset is balanced across task types and difficulty levels to avoid training biases.

What's Next

Build the ingestion pipeline for Isaac Sim task recordings
Integrate public robotics datasets (DROID, Open X-Embodiment) into the unified format
Implement frame-action alignment and temporal subsampling
Run first fine-tuning experiments on a baseline VLA with the curated dataset