Model Checkpointing

DeepSpeed provides routines for checkpointing model state during training.

Loading Training Checkpoints

deepspeed.DeepSpeedEngine.load_checkpoint(self, load_dir, tag=None, load_module_strict=True, load_optimizer_states=True, load_lr_scheduler_states=True)

Load training checkpoint

  • load_dir – Required. Directory to load the checkpoint from
  • tag – Checkpoint tag used as a unique identifier for checkpoint, if not provided will attempt to load tag in ‘latest’ file
  • load_module_strict – Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match.
  • load_optimizer_states – Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance
  • load_lr_scheduler_states – Optional. Boolean to add the learning rate scheduler states from Checkpoint.

A tuple of load_path and client_state.

*load_path: Path of the loaded checkpoint. None if loading the checkpoint failed.

*client_state: State dictionary used for loading required training states in the client code.

Saving Training Checkpoints

deepspeed.DeepSpeedEngine.save_checkpoint(self, save_dir, tag=None, client_state={}, save_latest=True)

Save training checkpoint

  • save_dir – Required. Directory for saving the checkpoint
  • tag – Optional. Checkpoint tag used as a unique identifier for the checkpoint, global step is used if not provided. Tag name must be the same across all ranks.
  • client_state – Optional. State dictionary used for saving required training states in the client code.
  • save_latest – Optional. Save a file ‘latest’ pointing to the latest saved checkpoint.

Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.