1. Background

1.1. OpenVLA

1.1.1. Model Architecture: VLM = ViT + Llama2

OpenVLA is built on a VLM, with a ViT encoder and a Llama2 7B backbone.
OpenVLA processes input images and language commands to generate action policies.

1.1.2. Closed-Loop Robot Control Policy

After each action is executed by the robot, OpenVLA processes the new image and the original command to give the robot a new action until the task is complete.

2. Problems

2.1. Problem 1: Large Model Footprint

OpenVLA has 7B parameters so both training and inference are slow.

2.2. Problem 2

OpenVLA’s input (images) and output (actions) can be improved.

3. Hypothesis

4. Solutions

4.1. Solution 1 (for Problem 1): Use a Smaller Backbone Model

Replace the Llama2 7B backbone in the OpenVLA model with a Qwen2.5 0.5B backbone, keep the ViT encoder, and train the VLM model with the Llava-1.5-Instruct Visual Question Answering (VQA) dataset.

4.2. Solution 2 (for Problem 2): Adding Multi-Image Support to Inputs

Add history and wrist images to inputs.

4.3. Solution 3 (for Problem 2): Using Vector Quantized Action Chunking for Outputs

4.3.1. OpenVLA’s Binning Schema