Optimizers

DeepSpeed offers high-performance implementations of Adam optimizer on CPU; FusedAdam, FusedLamb, OnebitAdam, OnebitLamb optimizers on GPU.

Adam (CPU)

class deepspeed.ops.adam.DeepSpeedCPUAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, adamw_mode=True, fp32_optimizer_states=True)[source]

Fast vectorized implementation of two variations of Adam optimizer on CPU:

Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);
AdamW: Fixing Weight Decay Regularization in Adam (https://arxiv.org/abs/1711.05101)

DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W). In order to apply this optimizer, the model requires to have its master parameter (in FP32) reside on the CPU memory.

To train on a heterogeneous system, such as coordinating CPU and GPU, DeepSpeed offers the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory, with minimal impact on training throughput. DeepSpeedCPUAdam plays an important role to minimize the overhead of the optimizer’s latency on CPU. Please refer to ZeRO-Offload tutorial (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.

Note

We recommend using our config to allow deepspeed.initialize() to build this optimizer for you.

Parameters

model_params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
adamw_mode – select between Adam and AdamW implementations (default: AdamW)
fp32_optimizer_states – creates momentum and variance in full precision regardless of the precision of the parameters. Set to False to keep optimizer states in the parameter dtype (e.g. bf16), which reduces the optimizer-state memory footprint at the cost of lower state precision. (default: True)

FusedAdam (GPU)

class deepspeed.ops.adam.FusedAdam(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, adam_w_mode=True, weight_decay=0.0, amsgrad=False, set_grad_none=True)[source]

Implements Adam algorithm.

Currently GPU-only. Requires Apex to be installed via pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./.

This version of fused Adam implements 2 fusions.

Fusion of the Adam update’s elementwise operations

A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

apex.optimizers.FusedAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adam_w_mode=False:

opt = apex.optimizers.FusedAdam(model.parameters(), lr = ....)
...
opt.step()

apex.optimizers.FusedAdam may be used with or without Amp. If you wish to use FusedAdam with Amp, you may choose any opt_level:

opt = apex.optimizers.FusedAdam(model.parameters(), lr = ....)
model, opt = amp.initialize(model, opt, opt_level="O0" or "O1 or "O2")
...
opt.step()

In general, opt_level="O1" is recommended.

Warning

A previous version of FusedAdam allowed a number of additional arguments to step. These additional arguments are now deprecated and unnecessary.

Adam was been proposed in `Adam: A Method for Stochastic Optimization`_.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!
adam_w_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)
set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)

FusedLamb (GPU)

class deepspeed.ops.lamb.FusedLamb(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, eps_inside_sqrt=False, weight_decay=0.0, max_grad_norm=0.0, max_coeff=10.0, min_coeff=0.01, amsgrad=False)[source]

Implements the LAMB algorithm. Currently GPU-only.

LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
bias_correction (bool, optional) – bias correction (default: True)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
max_grad_norm (float, optional) – value used to clip global grad norm (default: 0.0)
max_coeff (float, optional) – maximum value of the lamb coefficient (default: 10.0)
min_coeff (float, optional) – minimum value of the lamb coefficient (default: 0.01)
amsgrad (boolean, optional) – NOT SUPPORTED in FusedLamb!

Optimizers

Adam (CPU)

FusedAdam (GPU)

FusedLamb (GPU)

OneBitAdam (GPU)

ZeroOneAdam (GPU)

OnebitLamb (GPU)