AutoEP (Automatic Expert Parallelism)

AutoEP automatically detects MoE layers in Hugging Face models and replaces them with EP-enabled versions, requiring zero model code changes. It follows the pattern of AutoTP (Automatic Tensor Parallelism).

This API is separate from the explicit deepspeed.moe.layer.MoE layer API. For the explicit DeepSpeed MoE layer API, see Mixture of Experts (DeepSpeed MoE).

Built-in AutoEP presets: mixtral (Mixtral), qwen3_moe (Qwen3-MoE), qwen3_5_moe (Qwen3.5-MoE), deepseek_v2 (DeepSeek-V2), and deepseek_v3 (DeepSeek-V3).

The preset name means AutoEP knows the router, expert, and weight naming patterns for that model family. Running a Hugging Face model also requires a Transformers build that exposes the matching config/model classes, model.config.model_type value, and fused expert layout.

AutoEP preset compatibility by Transformers version

Preset

Minimum Transformers version

Notes

mixtral

5.0.0

qwen3_moe

5.0.0

Also covers Qwen2-MoE when the installed Transformers build uses the validated fused expert layout. Qwen3-MoE classes appear in 4.51.3, but the tested 4.x builds do not match the validated AutoEP layout.

qwen3_5_moe

5.2.0

Requires the Qwen3.5 text-backbone qwen3_5_moe_text model type; for performance on Qwen3.5’s Gated DeltaNet layers, install optimized kernels. See the Hugging Face Transformers kernel loading docs and the Qwen FlashQLA blog.

deepseek_v2

5.0.0

load_balance_coeff / expert-bias auxiliary-loss-free load balancing is not currently supported; non-null values are rejected.

deepseek_v3

5.0.0

load_balance_coeff / expert-bias auxiliary-loss-free load balancing is not currently supported; non-null values are rejected.

ZeRO compatibility: Stages 0, 1, and 2. Stage 3 is not supported.

Usage:

{
    "expert_parallel": {
        "enabled": true,
        "autoep_size": 4,
        "preset_model": "mixtral"
    }
}

How it works:

  1. During deepspeed.initialize(), AutoEP scans the model for MoE layers using preset-defined patterns (router name, expert name, weight shapes).

  2. Detected MoE blocks are replaced with AutoEPMoELayer, which uses TorchTitan’s grouped GEMM kernels and AllToAll token dispatch.

  3. EP/EDP process groups are created automatically based on autoep_size.

  4. Expert parameters are marked for expert-data-parallel gradient reduction; router and shared-expert parameters use standard data-parallel reduction.

Constraints:

  • autoep_size must divide num_experts for all detected MoE layers.

  • autoep_size=1 is valid: all experts remain local (no AllToAll), useful for functional testing on a single GPU.

  • AutoEP currently cannot be combined with AutoTP (tensor_parallel.autotp_size > 1) or tensor model parallelism from mpu; support is planned as follow-up work.

  • AutoEP currently supports ZeRO stages 0, 1, and 2 only. ZeRO stage 3 and its partitioned-parameter get/set APIs are outside the scope of the current AutoEP support.

  • Checkpoint save/load requires matching autoep_size. To change autoep_size across runs for the same AutoEP-detected model topology, convert the checkpoint to Universal Checkpoint format and load it with checkpoint.load_universal; see the Universal Checkpointing tutorial for the detailed flow and constraints.

  • DeepSeek-V2 and DeepSeek-V3 AutoEP do not support load-balance expert bias yet. The built-in DeepSeek presets disable it by default; explicit non-null values fail.