VLA Overview¶

VLA (Vision-Language-Action) models combine vision, language, and action control. They can "understand scenes + interpret instructions + output actions" for robotic manipulation.

🧩 Hardware List¶

1 OpenArmX bimanual robot
2 RealSense D405 (left/right hand)
1 RealSense D435 (head)
1 teleoperation device: Pico4 Ultra (VR) / exoskeleton / OpenArmX leader version

One-line Summary¶

🖼️ See the scene + 💬 understand instruction -> 🤖 generate action

Core Capabilities¶

👀 Vision understanding: objects, positions, scene state
🧠 Language understanding: task goals from natural language
✋ Action decision: next action for arm/robot

Typical Pipeline¶

Camera images + text instruction
        ↓
      VLA model
        ↓
   Robot action sequence

Common Applications¶

🏠 Home service robotics
🏭 Industrial sorting and assembly
🧪 Laboratory automation

✅ Key value: VLA enables robots to naturally understand and execute human instructions.

OpenArmX VLA Workflow¶

Data collection (VR / homologous / exoskeleton)
          ↓
      LeRobot dataset
          ↓
       Model training
          ↓
      Real-robot inference

⚠️ Important Notes¶

Pi0/Pi0.5 and SmolVLA/XVLA have dependency conflicts; use separate environments.
For first experiments, complete a small closed loop: small data collection + short training + short inference.