Training Setup

Argument Parsing

DeepSpeed uses the argparse library to supply commandline configuration to the DeepSpeed runtime. Use deepspeed.add_config_arguments() to add DeepSpeed’s builtin arguments to your application’s parser.

parser = argparse.ArgumentParser(description='My training script.')
parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
# Include DeepSpeed configuration arguments
parser = deepspeed.add_config_arguments(parser)
cmd_args = parser.parse_args()
Update the argument parser to enabling parsing of DeepSpeed command line arguments.

The set of DeepSpeed arguments include the following: 1) –deepspeed: boolean flag to enable DeepSpeed 2) –deepspeed_config <json file path>: path of a json configuration file to configure DeepSpeed runtime.


parser – argument parser


Updated Parser

Return type


Training Initialization

The entrypoint for all training with DeepSpeed is deepspeed.initialize(). Will initialize distributed backend if it is not initialized already.

Example usage:

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
deepspeed.initialize(args=None, model: Optional[Module] = None, optimizer: Optional[Union[Optimizer, Callable[[Union[Iterable[Parameter], Dict[str, Iterable]]], Optimizer]]] = None, model_parameters: Optional[Module] = None, training_data: Optional[Dataset] = None, lr_scheduler: Optional[Union[_LRScheduler, Callable[[Optimizer], _LRScheduler]]] = None, distributed_port: int = 29500, mpu=None, dist_init_required: Optional[bool] = None, collate_fn=None, config=None, config_params=None)[source]

Initialize the DeepSpeed Engine.

  • args – an object containing local_rank and deepspeed_config fields. This is optional if config is passed.

  • model – Required: nn.module class before apply any wrappers

  • optimizer – Optional: a user defined Optimizer or Callable that returns an Optimizer object. This overrides any optimizer definition in the DeepSpeed json config.

  • model_parameters – Optional: An iterable of torch.Tensors or dicts. Specifies what Tensors should be optimized.

  • training_data – Optional: Dataset of type

  • lr_scheduler – Optional: Learning Rate Scheduler Object or a Callable that takes an Optimizer and returns a Scheduler object. The scheduler object should define a get_lr(), step(), state_dict(), and load_state_dict() methods

  • distributed_port – Optional: Master node (rank 0)’s free port that needs to be used for communication during distributed training

  • mpu – Optional: A model parallelism unit object that implements get_{model,data}_parallel_{rank,group,world_size}()

  • dist_init_required – Optional: None will auto-initialize torch distributed if needed, otherwise the user can force it to be initialized or not via boolean.

  • collate_fn – Optional: Merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • config – Optional: Instead of requiring args.deepspeed_config you can pass your deepspeed config as an argument instead, as a path or a dictionary.

  • config_params – Optional: Same as config, kept for backwards compatibility.


A tuple of engine, optimizer, training_dataloader, lr_scheduler

  • engine: DeepSpeed runtime engine which wraps the client model for distributed training.

  • optimizer: Wrapped optimizer if a user defined optimizer is supplied, or if optimizer is specified in json config else None.

  • training_dataloader: DeepSpeed dataloader if training_data was supplied, otherwise None.

  • lr_scheduler: Wrapped lr scheduler if user lr_scheduler is passed, or if lr_scheduler specified in JSON configuration. Otherwise None.

Distributed Initialization

Optional distributed backend initialization separate from deepspeed.initialize(). Useful in scenarios where the user wants to use torch distributed calls before calling deepspeed.initialize(), such as when using model parallelism, pipeline parallelism, or certain data loader scenarios.

deepspeed.init_distributed(dist_backend=None, auto_mpi_discovery=True, distributed_port=29500, verbose=True, timeout=datetime.timedelta(seconds=1800), init_method=None, dist_init_required=None, config=None, rank=-1, world_size=-1)[source]

Initialize dist backend, potentially performing MPI discovery if needed

  • dist_backend – Optional (str). torch distributed backend, e.g., nccl, mpi, gloo, hccl

  • Optional (auto_mpi_discovery) –

  • distributed_port – Optional (int). torch distributed backend port

  • verbose – Optional (bool). verbose logging

  • timeout – Optional (timedelta). Timeout for operations executed against the process group. Default value equals 30 minutes.

  • init_method – Optional (string). Torch distributed, URL specifying how to initialize the process group. Default is “env://” if no init_method or store is specified.

  • config – Optional (dict). DeepSpeed configuration for setting up comms options (e.g. Comms profiling)

  • rank – Optional (int). The current manually specified rank. Some init_method like “tcp://” need the rank and world_size as well (see:

  • world_size – Optional (int). Desired world_size for the TCP or Shared file-system initialization.