1. Background
1.1. OpenVLA
1.1.1. Model Architecture: VLM = ViT + Llama2
- OpenVLA is built on a VLM, with a ViT encoder and a Llama2 7B backbone.
- OpenVLA processes input images and language commands to generate action policies.
1.1.2. Closed-Loop Robot Control Policy
- After each action is executed by the robot, OpenVLA processes the new image and the original command to give the robot a new action until the task is complete.
2. Problems
2.1. Problem 1: Large Model Footprint
- OpenVLA has 7B parameters so both training and inference are slow.
2.2. Problem 2
- OpenVLA’s input (images) and output (actions) can be improved.
3. Hypothesis
4. Solutions
4.1. Solution 1 (for Problem 1): Use a Smaller Backbone Model
- Replace the Llama2 7B backbone in the OpenVLA model with a Qwen2.5 0.5B backbone, keep the ViT encoder, and train the VLM model with the Llava-1.5-Instruct Visual Question Answering (VQA) dataset.
4.2. Solution 2 (for Problem 2): Adding Multi-Image Support to Inputs
- Add history and wrist images to inputs.
4.3. Solution 3 (for Problem 2): Using Vector Quantized Action Chunking for Outputs
4.3.1. OpenVLA’s Binning Schema