1. Background

1.1. OpenVLA

1.1.1. Model Architecture: VLM = ViT + Llama2

1.1.2. Closed-Loop Robot Control Policy

2. Problems

2.1. Problem 1: Large Model Footprint

2.2. Problem 2

3. Hypothesis

4. Solutions

4.1. Solution 1 (for Problem 1): Use a Smaller Backbone Model

4.2. Solution 2 (for Problem 2): Adding Multi-Image Support to Inputs

4.3. Solution 3 (for Problem 2): Using Vector Quantized Action Chunking for Outputs

4.3.1. OpenVLA’s Binning Schema