vla-world-model-control

Python PyTorch NVIDIA Isaac Sim Isaac Lab Isaac Lab Arena ROS 2

Independent study at The Cooper Union under Prof. Carl Sable investigating how vision-language-action models can be combined with learned world models for robust long-horizon robotic control. The project uses NVIDIA Isaac Sim and Isaac Lab as the simulation backbone, with the goal of benchmarking existing VLA architectures and evaluating whether world-model augmentation improves planning over extended task sequences.

Environment Setup

The first stage of the project focuses on standing up the full simulation and training stack. The local workstation runs an NVIDIA GeForce RTX 5060 Ti with 16 GB of VRAM. On top of that, Isaac Sim provides the physics-based rendering environment, Isaac Lab supplies the reinforcement-learning and robot-learning framework, and Isaac Lab Arena adds a library of pre-built task scenes and benchmark configurations. Getting these layers to work together cleanly is a prerequisite for everything that follows.

VLA Benchmarking

Once the environment is stable, the next phase is to run multiple vision-language-action models across a variety of manipulation and navigation benchmarks inside Isaac Sim. The aim is to establish baseline performance for each VLA on standardized tasks—success rate, sample efficiency, and generalization to unseen object configurations—so that later comparisons against world-model-augmented variants are grounded in reproducible numbers.

World-Model Integration

The longer-term goal is to evaluate whether coupling a learned world model with a VLA improves performance on long-horizon tasks where pure reactive policies tend to fail. A world model that can predict future states gives the agent a planning signal: it can roll out candidate action sequences internally before committing to one in the real environment. The study will explore different integration strategies and measure the impact on multi-step task completion.

What's Next