Mixture of Experts (DeepSpeed MoE)

DeepSpeed provides two forms of MoE support: DeepSpeed MoE and AutoEP (Automatic Expert Parallelism). DeepSpeed MoE is the explicit deepspeed.moe.layer.MoE API for constructing MoE layers in model code. This page introduces the DeepSpeed MoE API.

See also the Mixture of Experts (DeepSpeed MoE) tutorial for training examples and configuration details.

class deepspeed.moe.layer.MoE(hidden_size: int, expert: Module, num_experts: int = 1, ep_size: int = 1, k: int = 1, capacity_factor: float = 1.0, eval_capacity_factor: float = 1.0, min_capacity: int = 4, use_residual: bool = False, noisy_gate_policy: Optional[str] = None, drop_tokens: bool = True, use_rts: bool = True, use_tutel: bool = False, enable_expert_tensor_parallelism: bool = False, top2_2nd_expert_sampling: bool = True)[source]

Initialize an MoE layer.

Parameters

hidden_size (int) – the hidden dimension of the model, importantly this is also the input and output dimension.
expert (nn.Module) – the torch module that defines the expert (e.g., MLP, torch.linear).
num_experts (int, optional) – default=1, the total number of experts per layer.
ep_size (int, optional) – default=1, number of ranks in the expert parallel world or group.
k (int, optional) – default=1, top-k gating value, only supports k=1 or k=2.
capacity_factor (float, optional) – default=1.0, the capacity of the expert at training time.
eval_capacity_factor (float, optional) – default=1.0, the capacity of the expert at eval time.
min_capacity (int, optional) – default=4, the minimum capacity per expert regardless of the capacity_factor.
use_residual (bool, optional) – default=False, make this MoE layer a Residual MoE (https://arxiv.org/abs/2201.05596) layer.
noisy_gate_policy (str, optional) – default=None, noisy gate policy, valid options are ‘Jitter’, ‘RSample’ or ‘None’.
drop_tokens (bool, optional) – default=True, whether to drop tokens - (setting to False is equivalent to infinite capacity).
use_rts (bool, optional) – default=True, whether to use Random Token Selection.
use_tutel (bool, optional) – default=False, whether to use Tutel optimizations (if installed).
enable_expert_tensor_parallelism (bool, optional) – default=False, whether to use tensor parallelism for experts
top2_2nd_expert_sampling (bool, optional) – default=True, whether to perform sampling for 2nd expert

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(hidden_states: Tensor, used_token: Optional[Tensor] = None) → Tuple[Tensor, Tensor, Tensor][source]

MoE forward

Parameters

hidden_states (Tensor) – input to the layer
used_token (Tensor, optional) – default: None, mask only used tokens

Returns

A tuple including output, gate loss, and expert count.

output (Tensor): output of the model
l_aux (Tensor): gate loss value
exp_counts (Tensor): expert count