1. Background

1.1. Universal Transformers

When compared to standard Transformers, Universal Transformers have the following features.

1.1.1. Characteristics

1.1.2. Advantages

1.1.3. Disadvantages

1.2. MoE: Mixture-of-Experts

1.2.1. MoE Feedforward Blocks: $\sigma$-MoE

$$ y_t = \sum_{e \in \varepsilon(x_t)} s_t[e] \text{ReLU}(x_t W_1^e) W_2^e $$

Screenshot 2025-01-01 at 1.13.43 AM.png