Mixture of Experts (MoE)

Layer specification

class deepspeed.moe.layer.MoE(hidden_size, expert, num_experts=1, k=1, capacity_factor=1.0, eval_capacity_factor=1.0, min_capacity=4, noisy_gate_policy: Optional[str] = None, drop_tokens: bool = True, use_rts=True, use_tutel: bool = False)[source]

Initialize an MoE layer.

  • hidden_size (int) – the hidden dimension of the model, importantly this is also the input and output dimension.
  • expert (torch.nn.Module) – the torch module that defines the expert (e.g., MLP, torch.linear).
  • num_experts (int, optional) – default=1, the total number of experts per layer.
  • k (int, optional) – default=1, top-k gating value, only supports k=1 or k=2.
  • capacity_factor (float, optional) – default=1.0, the capacity of the expert at training time.
  • eval_capacity_factor (float, optional) – default=1.0, the capacity of the expert at eval time.
  • min_capacity (int, optional) – default=4, the minimum capacity per expert regardless of the capacity_factor.
  • noisy_gate_policy (str, optional) – default=None, noisy gate policy, valid options are ‘Jitter’, ‘RSample’ or ‘None’.
  • drop_tokens (bool, optional) – default=True, whether to drop tokens - (setting to False is equivalent to infinite capacity).
  • use_rts (bool, optional) – default=True, whether to use Random Token Selection.
  • use_tutel (bool, optional) – default=False, whether to use Tutel optimizations (if installed).
forward(hidden_states, used_token=None)[source]

MoE forward

  • hidden_states (Tensor) – input to the layer
  • used_token (Tensor, optional) – default: None, mask only used tokens

A tuple including output, gate loss, and expert count.

  • output (Tensor): output of the model
  • l_aux (Tensor): gate loss value
  • exp_counts (int): expert count

Groups initialization

deepspeed.utils.groups.initialize(ep_size=1, mpu=None, num_ep_list=None)[source]

Process groups initialization supporting expert (E), data (D), and model (M) parallelism. DeepSpeed considers the following scenarios w.r.t. process group creation.

  • S1: There is no expert parallelism or model parallelism, only data (D):

    model = my_model(args)
    engine = deepspeed.initialize(model) # initialize groups without mpu
  • S2: There is expert parallelism but no model parallelism (E+D):

    deepspeed.utils.groups.initialize(ep_size) # groups will be initialized here
    model = my_model(args)
    engine = deepspeed.initialize(model)
  • S3: There is model parallelism but no expert parallelism (M):

    mpu.init() # client initializes it's model parallel unit
    model = my_model(args)
    engine = deepspeed.initialize(model, mpu=mpu) # init w. mpu but ep_size = dp_world_size
  • S4: There is model, data, and expert parallelism (E+D+M):

    mpu.init() # client initializes it's model parallel unit
    deepspeed.utils.groups.initialize(ep_size, mpu) # initialize expert groups wrt mpu
    model = my_model(args)
    engine = deepspeed.initialize(model, mpu=mpu) # passing mpu is optional in this case
  • ep_size (int, optional) – default=1, maximum expert parallel size, which should be divisible/divided by the world size.
  • each element in num_ep_list. (by) –
  • mpu (module, optional) – default=None, model parallel unit (e.g., from Megatron) that describes model/data parallel ranks.
  • num_ep_list (list, optional) – default=None, list of number of expert parallel sizes in each MoE layer.