Skip to content

VLA Overview

VLA (Vision-Language-Action) models combine vision, language, and action control. They can "understand scenes + interpret instructions + output actions" for robotic manipulation.

🧩 Hardware List

  • 1 OpenArmX bimanual robot
  • 2 RealSense D405 (left/right hand)
  • 1 RealSense D435 (head)
  • 1 teleoperation device: Pico4 Ultra (VR) / exoskeleton / OpenArmX leader version

One-line Summary

🖼️ See the scene + 💬 understand instruction -> 🤖 generate action

Core Capabilities

  • 👀 Vision understanding: objects, positions, scene state
  • 🧠 Language understanding: task goals from natural language
  • ✋ Action decision: next action for arm/robot

Typical Pipeline

Camera images + text instruction
        ↓
      VLA model
        ↓
   Robot action sequence

Common Applications

  • 🏠 Home service robotics
  • 🏭 Industrial sorting and assembly
  • 🧪 Laboratory automation

✅ Key value: VLA enables robots to naturally understand and execute human instructions.

OpenArmX VLA Workflow

Data collection (VR / homologous / exoskeleton)
          ↓
      LeRobot dataset
          ↓
       Model training
          ↓
      Real-robot inference
  1. VLA/Environment_Setup.md
  2. VLA/Data_Collection/VR.md
  3. VLA/Training/Model_Download.md
  4. VLA/Training/Single_GPU_Training.md or VLA/Training/Multi_GPU_Training.md
  5. VLA/Inference.md

⚠️ Important Notes

  • Pi0/Pi0.5 and SmolVLA/XVLA have dependency conflicts; use separate environments.
  • For first experiments, complete a small closed loop: small data collection + short training + short inference.