When compared to standard Transformers, Universal Transformers have the following features.
Expert Division: $\sigma$-MoE divides the up projection and the down projection into $N_E$ experts $W_1^e \in \R^{d_{model} \times d_{expert}}$ and $W_2^e \in \R^{d_{expert} \times d_{model}}$, where $e \in \{1, ..., N_E\}$.
Expert Selection: $\sigma$-MoE selects top-k experts with highest scores for each token respectively.
$$ s_t = \sigma(x_tW_S), \quad \text{where} \quad x_t \in \R^{d_{model}} \quad \text{and}\quad W_S \in \R^{d_{model} \times N_E} $$
$$ \varepsilon(x_t) = \text{arg topk}(s_t, K), \quad \text{where} \quad s_t \in \R^{N_E} $$
Expert Mixture: $\sigma$-MoE mixes the weighted sum of top-k experts’ projected values to get the final outputs $y_t$ for each token respectively.
$$ y_t = \sum_{e \in \varepsilon(x_t)} s_t[e] \text{ReLU}(x_t W_1^e) W_2^e $$