Mixture of Experts (MoE)

Layer specification

class deepspeed.moe.layer.MoE(hidden_size, expert, num_experts=1, ep_size=1, k=1, capacity_factor=1.0, eval_capacity_factor=1.0, min_capacity=4, use_residual=False, noisy_gate_policy: Optional[str] = None, drop_tokens: bool = True, use_rts=True, use_tutel: bool = False, enable_expert_tensor_parallelism: bool = False)[source]
forward(hidden_states, used_token=None)[source]

MoE forward

  • hidden_states (Tensor) – input to the layer

  • used_token (Tensor, optional) – default: None, mask only used tokens


A tuple including output, gate loss, and expert count.

  • output (Tensor): output of the model

  • l_aux (Tensor): gate loss value

  • exp_counts (int): expert count