Optimizers

DeepSpeed offers high-performance implementations of Adam optimizer on CPU; FusedAdam, FusedLamb, OnebitAdam, OnebitLamb optimizers on GPU.

Adam (CPU)

class deepspeed.ops.adam.DeepSpeedCPUAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, adamw_mode=True, fp32_optimizer_states=True)[source]

Fast vectorized implementation of two variations of Adam optimizer on CPU:

DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W). In order to apply this optimizer, the model requires to have its master parameter (in FP32) reside on the CPU memory.

To train on a heterogeneous system, such as coordinating CPU and GPU, DeepSpeed offers the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory, with minimal impact on training throughput. DeepSpeedCPUAdam plays an important role to minimize the overhead of the optimizer’s latency on CPU. Please refer to ZeRO-Offload tutorial (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.

For calling step function, there are two options available: (1) update optimizer’s states and (2) update optimizer’s states and copy the parameters back to GPU at the same time. We have seen that the second option can bring 30% higher throughput than the doing the copy separately using option one.

Note

We recommend using our config to allow deepspeed.initialize() to build this optimizer for you.

Parameters:
  • model_params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
  • adamw_mode – select between Adam and AdamW implementations (default: AdamW)
  • full_precision_optimizer_states – creates momementum and variance in full precision regardless of the precision of the parameters (default: True)

FusedAdam (GPU)

class deepspeed.ops.adam.FusedAdam(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, adam_w_mode=True, weight_decay=0.0, amsgrad=False, set_grad_none=True)[source]

Implements Adam algorithm.

Currently GPU-only.

This version of fused Adam implements 2 fusions.

  • Fusion of the Adam update’s elementwise operations
  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

Adam was been proposed in Adam: A Method for Stochastic Optimization.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!
  • adam_w_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)
  • set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)

FusedLamb (GPU)

class deepspeed.ops.lamb.FusedLamb(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, eps_inside_sqrt=False, weight_decay=0.0, max_grad_norm=0.0, max_coeff=10.0, min_coeff=0.01, amsgrad=False)[source]

Implements the LAMB algorithm. Currently GPU-only.

LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • bias_correction (bool, optional) – bias correction (default: True)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • max_grad_norm (float, optional) – value used to clip global grad norm (default: 0.0)
  • max_coeff (float, optional) – maximum value of the lamb coefficient (default: 10.0)
  • min_coeff (float, optional) – minimum value of the lamb coefficient (default: 0.01)
  • amsgrad (boolean, optional) – NOT SUPPORTED in FusedLamb!

OneBitAdam (GPU)

class deepspeed.runtime.fp16.onebit.adam.OnebitAdam(params, deepspeed=None, lr=0.001, freeze_step=100000, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, eps_inside_sqrt=False, weight_decay=0.0, max_grad_norm=0.0, amsgrad=False, cuda_aware=False, comm_backend_name='nccl')[source]

Implements the 1-bit Adam algorithm. Currently GPU-only. For usage example please see https://www.deepspeed.ai/tutorials/onebit-adam/ For technical details please read https://arxiv.org/abs/2102.02888

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • freeze_step (int, optional) – Number of steps for warmup (uncompressed) stage before we start using compressed communication. (default 100000)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in 1-bit Adam!
  • eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
  • cuda_aware (boolean, required) – Set True if the underlying MPI implementation supports CUDA-Aware communication. (default: False)
  • comm_backend_name (string, optional) – Set to ‘mpi’ if needed. (default: ‘nccl’)

OnebitLamb (GPU)

class deepspeed.runtime.fp16.onebit.lamb.OnebitLamb(params, deepspeed=None, lr=0.001, freeze_step=100000, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, eps_inside_sqrt=False, weight_decay=0.0, max_grad_norm=0.0, max_coeff=10.0, min_coeff=0.01, amsgrad=False, cuda_aware=False, comm_backend_name='nccl', coeff_beta=0.9, factor_max=4.0, factor_min=0.5, factor_threshold=0.1)[source]

Implements the 1-bit Lamb algorithm. Currently GPU-only. For usage example please see https://www.deepspeed.ai/tutorials/onebit-lamb/ For technical details please see our paper https://arxiv.org/abs/2104.06069.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • freeze_step (int, optional) – Number of steps for warmup (uncompressed) stage before we start using compressed communication. (default 100000)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • max_coeff (float, optional) – maximum value of the lamb coefficient (default: 10.0)
  • min_coeff (float, optional) – minimum value of the lamb coefficient (default: 0.01)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in 1-bit Lamb!
  • eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
  • cuda_aware (boolean, required) – Set True if the underlying MPI implementation supports CUDA-Aware communication. (default: False)
  • comm_backend_name (string, optional) – Set to ‘mpi’ if needed. (default: ‘nccl’)
  • coeff_beta (float, optional) – coefficient used for computing running averages of lamb coefficient (default: 0.9) note that you may want to increase or decrease this beta depending on the freeze_step you choose, as 1/(1 - coeff_beta) should be smaller than or equal to freeze_step
  • factor_max (float, optional) – maximum value of scaling factor to the frozen lamb coefficient during compression stage (default: 4.0)
  • factor_min (float, optional) – minimum value of scaling factor to the frozen lamb coefficient during compression stage (default: 0.5)
  • factor_threshold (float, optional) – threshold of how much the scaling factor can fluctuate between steps (default: 0.1)