VLA Overview¶
VLA (Vision-Language-Action) models combine vision, language, and action control. They can "understand scenes + interpret instructions + output actions" for robotic manipulation.
🧩 Hardware List¶
- 1 OpenArmX bimanual robot
- 2 RealSense D405 (left/right hand)
- 1 RealSense D435 (head)
- 1 teleoperation device: Pico4 Ultra (VR) / exoskeleton / OpenArmX leader version
One-line Summary¶
🖼️ See the scene + 💬 understand instruction -> 🤖 generate action
Core Capabilities¶
- 👀 Vision understanding: objects, positions, scene state
- 🧠 Language understanding: task goals from natural language
- ✋ Action decision: next action for arm/robot
Typical Pipeline¶
Camera images + text instruction
↓
VLA model
↓
Robot action sequence
Common Applications¶
- 🏠 Home service robotics
- 🏭 Industrial sorting and assembly
- 🧪 Laboratory automation
✅ Key value: VLA enables robots to naturally understand and execute human instructions.
OpenArmX VLA Workflow¶
Data collection (VR / homologous / exoskeleton)
↓
LeRobot dataset
↓
Model training
↓
Real-robot inference
Recommended Reading Order¶
VLA/Environment_Setup.mdVLA/Data_Collection/VR.mdVLA/Training/Model_Download.mdVLA/Training/Single_GPU_Training.mdorVLA/Training/Multi_GPU_Training.mdVLA/Inference.md
⚠️ Important Notes¶
Pi0/Pi0.5andSmolVLA/XVLAhave dependency conflicts; use separate environments.- For first experiments, complete a small closed loop: small data collection + short training + short inference.