1. Background

1.1. Universal Transformers

When compared to standard Transformers, Universal Transformers have the following features.

Expert Division: $\sigma$-MoE divides the up projection and the down projection into $N_E$ experts $W_1^e \in \R^{d_{model} \times d_{expert}}$ and $W_2^e \in \R^{d_{expert} \times d_{model}}$, where $e \in \{1, ..., N_E\}$.
Expert Selection: $\sigma$-MoE selects top-k experts with highest scores for each token respectively.
- Step 1: Compute expert scores $s_t$ for each token $x_t$.
$$ s_t = \sigma(x_tW_S), \quad \text{where} \quad x_t \in \R^{d_{model}} \quad \text{and}\quad W_S \in \R^{d_{model} \times N_E} $$
- Step 2: Select top-k experts $\varepsilon(x_t)$ with highest expert scores.
$$ \varepsilon(x_t) = \text{arg topk}(s_t, K), \quad \text{where} \quad s_t \in \R^{N_E} $$
Expert Mixture: $\sigma$-MoE mixes the weighted sum of top-k experts’ projected values to get the final outputs $y_t$ for each token respectively.

$$ y_t = \sum_{e \in \varepsilon(x_t)} s_t[e] \text{ReLU}(x_t W_1^e) W_2^e $$

Regularization: This paper replaces the original regularization of $\sigma$-MoE by a balancing loss $L$, because the original regularization is unstable during training. The balancing loss $L$ is applied only within sequences (as opposed to all tokens in a batch).

Screenshot 2025-01-01 at 1.13.43 AM.png