ZeRO

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.

  1. ZeRO Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.

  2. ZeRO Stage 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.

  3. ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

In addition, ZeRO-3 includes the infinity offload engine to form ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload all model states to both CPU and NVMe memory for huge memory savings.

For a deep dive of our algorithms, please see our papers on ZeRO, ZeRO-Offload, and ZeRO-Infinity.

Note

DeepSpeed first included offloading capabilities with ZeRO-Offload, a system for offloading optimizer and gradient states to CPU memory within ZeRO-2. ZeRO-Infinity is the next generation of offloading capabilities, accessible to ZeRO-3. ZeRO-Infinity has all of the savings of ZeRO-Offload, plus is able to offload more the model weights and has more effective bandwidth utilization and overlapping of computation and communication.

Getting Started

If you are new to DeepSpeed, check out our Getting Started page.

Once you are training with DeepSpeed, enabling ZeRO-3 offload is as simple as enabling it in your DeepSpeed configuration! Below are a few examples of ZeRO-3 configurations. Please see our config guide for a complete list of options for configuration and performance tuning.

Note

ZeRO-Infinity and ZeRO-Offload work best with our heavily optimized deepspeed.ops.adam.DeepSpeedCPUAdam optimizer. We recommend using our optimizer config to instruct deepspeed.initialize() to build the optimizer for you.

ZeRO Configurations

All the settings for DeepSpeed ZeRO are set with the DeepSpeedZeroConfig. The dictionary provided under the zero_optimization entry of the main DeepSpeed configuration dict will be parsed and validated with this class. Sub-configurations for parameter offload and optimzer offload settings are parsed by DeepSpeedZeroOffloadParamConfig and DeepSpeedZeroOffloadOptimizerConfig.

class deepspeed.runtime.zero.config.DeepSpeedZeroConfig[source]

Sets parameters for ZeRO optimizations.

stage: ZeroStageEnum = 0

Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

contiguous_gradients: bool = True

Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass.

reduce_scatter: bool = True

Uses reduce or reduce scatter instead of allreduce to average gradients

reduce_bucket_size: int = 500,000,000

Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes

Constraints
  • minimum = 0

allgather_partitions: bool = True

Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step

allgather_bucket_size: int = 500,000,000

Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes

Constraints
  • minimum = 0

overlap_comm: bool = None

Attempts to overlap the reduction of the gradients with backward computation

load_from_fp32_weights: bool = True

Boolean indicating whether to initialize fp32 master weights from fp32 copies in checkpoint (no precision loss) or from model’s fp16 copies (with precision loss). This can be used to initialize optimizer state even when checkpoint is missing optimizer state.

elastic_checkpoint: bool = False

Enable loading checkpoint that was saved by job with different GPU count. No longer supported.

offload_param: Optional[DeepSpeedZeroOffloadParamConfig] = None

Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. Expects a dictionary containing values for DeepSpeedZeroOffloadParamConfig.

offload_optimizer: Optional[DeepSpeedZeroOffloadOptimizerConfig] = None

Enable offloading of optimizer state to CPU or NVMe, and optimizer computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid for ZeRO stage 1, 2, 3. Expects a dictionary containing values for DeepSpeedZeroOffloadOptimizerConfig.

sub_group_size: int = 1,000,000,000

Tile size for parameter processing to fit massive models (with trillions of parameters). Used by ZeRO3-Offload and ZeRO-Infinity

Constraints
  • minimum = 0

cpu_offload_param: bool = None

Deprecated, please use offload_param

cpu_offload_use_pin_memory: bool = None

Deprecated, please use offload_param or offload_optimizer

cpu_offload: bool = None

Deprecated, please use offload_optimizer

prefetch_bucket_size: int = 50,000,000 (alias 'stage3_prefetch_bucket_size')

Maximum number of parameter elements to fetch ahead of use. Used by ZeRO3, ZeRO3-Offload, ZeRO-Infinity, and ZeRO-Inference.

Constraints
  • minimum = 0

param_persistence_threshold: int = 100,000 (alias 'stage3_param_persistence_threshold')

Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages).

Constraints
  • minimum = 0

model_persistence_threshold: int = sys.maxsize (alias 'stage3_model_persistence_threshold')

Maximum number of parameter elements that can be persisted in GPU and not partitioned. This imposes an upper bound on the number of unpartitioned parameters resulting from param_persistence_threshold setting. Used by ZeRO3-Offload, ZeRO-Infinity and ZeRO-Inference.

Constraints
  • minimum = 0

max_live_parameters: int = 1,000,000,000 (alias 'stage3_max_live_parameters')

The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication.

Constraints
  • minimum = 0

max_reuse_distance: int = 1,000,000,000 (alias 'stage3_max_reuse_distance')

Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but perform more communication.

Constraints
  • minimum = 0

gather_16bit_weights_on_model_save: bool = False (alias 'stage3_gather_16bit_weights_on_model_save')

Consolidate the weights before saving the model by save_16bit_model(). Since the weights are partitioned across GPUs, they aren’t part of state_dict, so this function automatically gathers the weights when this option is enabled and then saves the fp16 model weights.

stage3_gather_fp16_weights_on_model_save: bool = False

Deprecated, please use gather_16bit_weights_on_model_save

ignore_unused_parameters: bool = True

Unused parameters in modules may be unexpected in static networks, but could be normal in dynamic networks. This controls whether or not training should terminate with an error message when unused parameters are detected. This is set to False by default, which means unused parameters are ignored and training continues. Now is just used in stage 2.

legacy_stage1: bool = False

For backward-compatibility enable old ZeRO stage 1 implementation. Use at your own risk, will be deprecated soon.

round_robin_gradients: bool = False

Stage 1 and 2 optimization for CPU offloading that parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism).

class deepspeed.runtime.zero.config.DeepSpeedZeroOffloadParamConfig[source]

Set options for parameter offload. Valid only with stage 3.

device: OffloadDeviceEnum = 'none'

Device memory to offload model parameters. Supported options are cpu and nvme.

nvme_path: Path = None

Filesystem path for NVMe device for parameter offloading.

buffer_count: int = 5

Number of buffers in buffer pool for parameter offloading to NVMe.

Constraints
  • minimum = 0

buffer_size: int = 100,000,000

Size of buffers in buffer pool for parameter offloading to NVMe.

Constraints
  • minimum = 0

max_in_cpu: int = 1,000,000,000

Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabled.

Constraints
  • minimum = 0

pin_memory: bool = False

Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead.

class deepspeed.runtime.zero.config.DeepSpeedZeroOffloadOptimizerConfig[source]

Set options for optimizer offload. Valid with stage 1, 2, and 3.

device: OffloadDeviceEnum = 'none'

Device memory to offload optimizer state. Supported options are cpu and nvme. Optimizer computation is offload to CPU regardless of device option.

nvme_path: Path = None

Filesystem path for NVMe device for optimizer state offloading.

buffer_count: int = 4

Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance).

Constraints
  • minimum = 0

pin_memory: bool = False

Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead.

pipeline_read: bool = False

For tile-based optimizer step processing, overlap read of next tile with computation of current tile. Used in ZeRO-Infinity.

pipeline_write: bool = False

For tile-based optimizer step processing, overlap write of previous tile with computation of current tile.

fast_init: bool = False

Enable fast optimizer initialization when offloading to NVMe.

Example ZeRO-3 Configurations

  1. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2), and parameters (stage 3).

    {
        "zero_optimization": {
            "stage": 3,
        },
        "fp16": {
            "enabled": true
        },
        "optimizer": {
            "type": "AdamW",
            "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 3e-7
            }
        },
        ...
    }
    
  2. Additionally offload the optimizer states and computations to the CPU with ZeRO-Infinity.

    {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu"
            }
        },
        ...
    }
    
  3. Save even more memory by offloading parameters to the CPU memory.

    {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu"
            }
            "offload_param": {
                "device": "cpu"
            }
        },
        ...
    }
    
  4. Save even MORE memory by offloading to NVMe (if available on your system):

    {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "nvme",
                "nvme_path": "/nvme_data"
            }
            "offload_param": {
                "device": "nvme",
                "nvme_path": "/nvme_data"
            }
        },
        ...
    }
    

Assumptions

DeepSpeed automatically coordinates the collection (i.e., all-gather), partitioning (i.e., scatter), and offloading of parameters at the granularity of (sub)module forward() methods. The backward pass is handled similarly. This strategy has two underlying assumptions:

  1. The forward and backward passes of submodules must individually fit in device memory. If this not the case, deepspeed.zero.TiledLinear implements memory-centric tiling and works with ZeRO-3 to break linear layers into a sequence of smaller submodules that can fit in memory.

  2. A module’s parameters are only accessed within its own __init__ and forward() methods. Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter. See Manual Parameter Coordination for manually coordinating parameters.

Constructing Massive Models

ZeRO-3 enables massive models whose parameters exceed the size of individual nodes in a system. For the typical case of training without model parallelism, you can simply allocate your model in our context:

with deepspeed.zero.Init():
    model = MyLargeModel()
class deepspeed.zero.Init(module=None, data_parallel_group=None, mem_efficient_linear=True, remote_device=None, pin_memory=False, config_dict_or_path=None, config=None, enabled=True, dtype=None, mpu=None)

Manual Parameter Coordination

Most models require no modification to be trained with ZeRO-3. However, in some cases one may need to access model weights outside of the training loop, or to share weights across submodules during training. DeepSpeed has several mechanisms to coordinate partitioned weights for ZeRO-3.

Gathering Parameters

DeepSpeed provides mechanisms for collecting (or gathering) a partitioned parameter.

Some models partitioned with deepspeed.zero.Init may need to access a module’s weights outside of the class constructor or its forward() method. We refer to these weights as external parameters, since these parameters are accessed outside of the module that created them. To do so, use deepspeed.zero.GatheredParameters or deepspeed.zero.register_external_parameter().

class deepspeed.zero.GatheredParameters(params, modifier_rank=None, fwd_module=None, enabled=True)

Registering External Parameters

ZeRO-3 will automatically collect and partition the model parameters as they are needed during the forward and backward passes. However, in some cases a parameter may be used outside of its module’s forward pass. We call these external parameters. ZeRO-3 can coordinate these parameters if they are registered either automatically or manually.

Note

DeepSpeed version 0.3.15 includes automatic external parameter discovery and registration to support the most common cases. Parameters can still be manually registered if they cannot be automatically detected.

DeepSpeed can automatically detect the following external parameter scenarios:

  1. Parameter access: consider the following pattern common in language models such as GPT:

    The tensor embeddings.weight is used in both embeddings.forward() and compute_logits(). We call embeddings.weight an external parameter because it is used in the training loop outside of its owning module’s forward pass.

    class LanguageModel(torch.nn.Module):
        ...
        def forward(self, inputs):
            embeds = self.embeddings(inputs)
            ...
            logits = compute_logits(output, self.embeddings.weight)
            ...
    
  2. Returning a parameter:

    CustomLinear returns both an output and its own bias parameter. DeepSpeed will detect the external bias parameter and register it with submodules that use CustomLinear.

    class CustomLinear(torch.nn.Linear):
        def forward(self, *input):
            output = super().forward(*input)
            return output, self.bias
    
deepspeed.zero.register_external_parameter(module, parameter)

Instruct DeepSpeed to coordinate parameter’s collection and partitioning in the forward and backward passes of module.

This is used when a parameter is accessed outside of its owning module’s forward(). DeepSpeed must know to collect it from its partitioned state and when to release the memory.

Note

This is only applicable to training with ZeRO stage 3.

Parameters
  • module (torch.nn.Module) – The module that requires parameter in its forward pass.

  • parameter (torch.nn.Parameter) – The parameter to register.

Raises

RuntimeError – If parameter is not of type torch.nn.Parameter.

Examples

  1. Register a weight that is used in another module’s forward pass (line 6). Parameter layer1.weight is used by layer2 (line 11).

     1class ModuleZ3(torch.nn.Module):
     2    def __init__(self, *args):
     3        super().__init__(self, *args)
     4        self.layer1 = SomeLayer()
     5        self.layer2 = OtherLayer()
     6        deepspeed.zero.register_external_parameter(self, self.layer1.weight)
     7
     8    def forward(self, input):
     9        x = self.layer1(input)
    10        # self.layer1.weight is required by self.layer2.forward
    11        y = self.layer2(x, self.layer1.weight)
    12        return y
    

Memory-Centric Tiling

To reduce the working memory requirements of DL training for large models, ZeRO-Infinity includes technique called memory-centric tiling that exploits the data fetch and release pattern of ZeRO-3 to reduce the working memory requirements by breaking down a large operator into smaller tiles that can be executed sequentially. When combined with ZeRO-3, the parameter and gradients of each tile can be fetched and released one at a time, reducing the working memory proportional to the number of tiles. Therefore, ZeRO-Infinity can support operators of arbitrary sizes, without refactoring for model parallelism to fit them in limited GPU memory.

class deepspeed.zero.TiledLinear(in_features, out_features, bias=True, in_splits=1, out_splits=1, input_is_already_split=False, combine_out_splits=True, linear_cls=<class 'torch.nn.modules.linear.Linear'>, init_linear=None, **kwargs)
forward(input_)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

copy_params_from(other)

Copy the weight and bias data from other.

This is especially useful for reproducible initialization and testing.

Equivalent to:

with torch.no_grad():
    self.weight.copy_(other.weight)
    if self.bias is not None:
        self.bias.copy_(other.bias)

Note

If ZeRO-3 is enabled, this is a collective operation and the updated parameters of data-parallel rank 0 will be visible on all ranks. See deepspeed.zero.GatheredParameters for more information.

Parameters

other (torch.nn.Linear) – the linear layer to copy from.