Optimizers

DeepSpeed offers high-performance implementations of Adam optimizer on CPU; FusedAdam, FusedAdam, OneBitAdam optimizers on GPU.

Adam (CPU)

class deepspeed.ops.adam.DeepSpeedCPUAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, adamw_mode=True)[source]

Fast vectorized implementation of two variations of Adam optimizer on CPU:

DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W). In order to apply this optimizer, the model requires to have its master parameter (in FP32) reside on the CPU memory.

To train on a hetrogeneous system, such as coordinating CPU and GPU, DeepSpeed offers the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory, with minimal impact on training througput. DeepSpeedCPUAdam plays an important role to minimize the overhead of the optimizer’s latency on CPU. Please refer to ZeRO-Offload tutorial (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.

For calling step function, there are two options available: (1) update optimizer’s states and (2) update optimizer’s states and copy the parameters back to GPU at the same time. We have seen that the second option can bring 30% higher throughput than the doing the copy separately using option one.

Note

We recommend using our config to allow deepspeed.initialize() to build this optimizer for you.

Parameters:
  • model_params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
  • adamw_mode – select between Adam and AdamW implementations (default: AdamW)

FusedAdam (GPU)

class deepspeed.ops.adam.FusedAdam(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, adam_w_mode=True, weight_decay=0.0, amsgrad=False, set_grad_none=True)[source]

Implements Adam algorithm.

Currently GPU-only.

This version of fused Adam implements 2 fusions.

  • Fusion of the Adam update’s elementwise operations
  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

Adam was been proposed in `Adam: A Method for Stochastic Optimization`_.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!
  • adam_w_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)
  • set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)

FusedLamb (GPU)

class deepspeed.ops.lamb.FusedLamb(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, eps_inside_sqrt=False, weight_decay=0.0, max_grad_norm=0.0, max_coeff=10.0, min_coeff=0.01, amsgrad=False)[source]

Implements the LAMB algorithm. Currently GPU-only.

LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • bias_correction (bool, optional) – bias correction (default: True)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • max_grad_norm (float, optional) – value used to clip global grad norm (default: 0.0)
  • max_coeff (float, optional) – maximum value of the lamb coefficient (default: 10.0)
  • min_coeff (float, optional) – minimum value of the lamb coefficient (default: 0.01)
  • amsgrad (boolean, optional) – NOT SUPPORTED in FusedLamb!

OneBitAdam (GPU)