deepspeed.pt package

Submodules

deepspeed.pt.deepspeed_config module

Copyright (c) Microsoft Corporation Licensed under the MIT license.

class deepspeed.pt.deepspeed_config.DeepSpeedConfig(json_file, mpu=None, param_dict=None)

Bases: object

print(name)
class deepspeed.pt.deepspeed_config.DeepSpeedConfigWriter(data=None)

Bases: object

add_config(key, value)
load_config(filename)
write_config(filename)
deepspeed.pt.deepspeed_config.get_allgather_size(param_dict)
deepspeed.pt.deepspeed_config.get_allreduce_always_fp32(param_dict)
deepspeed.pt.deepspeed_config.get_amp_enabled(param_dict)
deepspeed.pt.deepspeed_config.get_amp_params(param_dict)
deepspeed.pt.deepspeed_config.get_disable_allgather(param_dict)
deepspeed.pt.deepspeed_config.get_dump_state(param_dict)
deepspeed.pt.deepspeed_config.get_dynamic_loss_scale_args(param_dict)
deepspeed.pt.deepspeed_config.get_fp16_enabled(param_dict)
deepspeed.pt.deepspeed_config.get_gradient_accumulation_steps(param_dict)
deepspeed.pt.deepspeed_config.get_gradient_clipping(param_dict)
deepspeed.pt.deepspeed_config.get_gradient_predivide_factor(param_dict)
deepspeed.pt.deepspeed_config.get_initial_dynamic_scale(param_dict)
deepspeed.pt.deepspeed_config.get_loss_scale(param_dict)
deepspeed.pt.deepspeed_config.get_memory_breakdown(param_dict)
deepspeed.pt.deepspeed_config.get_optimizer_gradient_clipping(param_dict)
deepspeed.pt.deepspeed_config.get_optimizer_legacy_fusion(param_dict)
deepspeed.pt.deepspeed_config.get_optimizer_name(param_dict)
deepspeed.pt.deepspeed_config.get_optimizer_params(param_dict)
deepspeed.pt.deepspeed_config.get_prescale_gradients(param_dict)
deepspeed.pt.deepspeed_config.get_scheduler_name(param_dict)
deepspeed.pt.deepspeed_config.get_scheduler_params(param_dict)
deepspeed.pt.deepspeed_config.get_sparse_gradients_enabled(param_dict)
deepspeed.pt.deepspeed_config.get_steps_per_print(param_dict)
deepspeed.pt.deepspeed_config.get_tensorboard_enabled(param_dict)
deepspeed.pt.deepspeed_config.get_tensorboard_job_name(param_dict)
deepspeed.pt.deepspeed_config.get_tensorboard_output_path(param_dict)
deepspeed.pt.deepspeed_config.get_train_batch_size(param_dict)
deepspeed.pt.deepspeed_config.get_train_micro_batch_size_per_gpu(param_dict)
deepspeed.pt.deepspeed_config.get_wall_clock_breakdown(param_dict)
deepspeed.pt.deepspeed_config.get_zero_allow_untested_optimizer(param_dict)
deepspeed.pt.deepspeed_config.get_zero_max_elements_per_comm(param_dict)
deepspeed.pt.deepspeed_config.get_zero_optimization(param_dict)
deepspeed.pt.deepspeed_config.get_zero_reduce_scatter(param_dict)

deepspeed.pt.deepspeed_constants module

Copyright (c) Microsoft Corporation Licensed under the MIT license.

deepspeed.pt.deepspeed_csr_tensor module

Copyright 2020 The Microsoft DeepSpeed Team

Implementation of a compressed sparse row (CSR) tensor. Similar in functionality to TensorFlow’s IndexedSlices implementation.

class deepspeed.pt.deepspeed_csr_tensor.CSRTensor(dense_tensor=None)

Bases: object

Compressed Sparse Row (CSR) Tensor

add(b)
sparse_size()
to_dense()
static type()

deepspeed.pt.deepspeed_dataloader module

Copyright 2019 The Microsoft DeepSpeed Team

class deepspeed.pt.deepspeed_dataloader.DeepSpeedDataLoader(dataset, batch_size, pin_memory, local_rank, tput_timer, collate_fn=None, num_local_io_workers=None, data_sampler=None, data_parallel_world_size=None, data_parallel_rank=None)

Bases: object

deepspeed.pt.deepspeed_fused_lamb module

Copyright 2019 The Microsoft DeepSpeed Team

Copyright NVIDIA/apex This file is adapted from NVIDIA/apex/optimizer/fused_adam and implements the LAMB optimizer

class deepspeed.pt.deepspeed_fused_lamb.FusedLamb(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, eps_inside_sqrt=False, weight_decay=0.0, max_grad_norm=0.0, max_coeff=10.0, min_coeff=0.01, amsgrad=False)

Bases: sphinx.ext.autodoc.importer._MockObject

Implements LAMB algorithm. Currently GPU-only. Requires DeepSpeed adapted Apex to be installed via python setup.py install --cuda_ext --cpp_ext.

For usage example please see, TODO DeepSpeed Tutorial

It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • max_coeff (float, optional) – maximum value of the lamb coefficient (default: 10.0)
  • min_coeff (float, optional) – minimum value of the lamb coefficient (default: 0.01)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!
  • eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
get_lamb_coeffs()
step(closure=None, grads=None, output_params=None, scale=1.0, grad_norms=None)

Performs a single optimization step.

Parameters:
  • closure (callable, optional) – A closure that reevaluates the model and returns the loss.
  • grads (list of tensors, optional) – weight gradient to use for the optimizer update. If gradients have type torch.half, parameters are expected to be in type torch.float. (default: None)
  • params (output) – A reduced precision copy of the updated weights written out in addition to the regular updated weights. Have to be of same type as gradients. (default: None)
  • scale (float, optional) – factor to divide gradient tensor values by before applying to weights. (default: 1)

deepspeed.pt.deepspeed_launch module

Copyright 2020 The Microsoft DeepSpeed Team: deepspeed@microsoft.com

deepspeed.pt.deepspeed_launch.main()
deepspeed.pt.deepspeed_launch.parse_args()

deepspeed.pt.deepspeed_light module

Copyright 2019 The Microsoft DeepSpeed Team

class deepspeed.pt.deepspeed_light.DeepSpeedLight(args, model, optimizer=None, model_parameters=None, training_data=None, lr_scheduler=None, mpu=None, dist_init_required=None, collate_fn=None, config_params=None)

Bases: sphinx.ext.autodoc.importer._MockObject

DeepSpeed engine for training.

all_gather_scalar(value)
allgather_size()
allreduce_always_fp32()
allreduce_and_copy(small_bucket)
allreduce_bucket(bucket)
allreduce_gradients(bucket_size=500000000)
allreduce_no_retain(bucket, numel_per_bucket=500000000)
amp_enabled()
amp_params()
backward(loss, allreduce_gradients=True)

Execute backward pass on the loss

Parameters:
  • loss – Torch tensor on which to execute backward propagation
  • allreduce_gradients – If this is False, then gradient averaging will be skipped. Default is True.
buffered_allreduce_fallback(grads=None, elements_per_buffer=500000000)
clip_fp32_gradients()
csr_all_gather(value)
csr_allreduce(csr)
csr_allreduce_bucket(bucket)
csr_allreduce_no_retain(bucket)
deepspeed_io(dataset, batch_size=None, route='train', pin_memory=True, data_sampler=None, collate_fn=None, num_local_io_workers=None)
dump_state()
dynamic_loss_scale()
dynamic_loss_scale_args()
eval()
forward(*inputs, **kwargs)

Execute forward propagation

Parameters:
  • *inputs – Variable length input list
  • **kwargs – variable length keyword arguments
fp16_enabled()
get_lr()
get_mom()
get_summary_writer(name='DeepSpeedJobName', base='/home/docs/tensorboard')
gradient_accumulation_steps()
gradient_clipping()
gradient_predivide_factor()
initial_dynamic_scale()
is_gradient_accumulation_boundary()
load_checkpoint(load_dir, tag, load_module_strict=True, load_optimizer_states=True, load_lr_scheduler_states=True)

Load training checkpoint

Parameters:
  • load_dir – Required. Directory to load the checkpoint from
  • tag – Required. Checkpoint tag used as a unique identifier for the checkpoint. Ex. Global Step.
  • load_module_strict – Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match.
  • load_optimizer_states – Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance
  • load_lr_scheduler_states – Optional. Boolean to add the learning rate scheduler states from Checkpoint.
Returns:

Path of the loaded checkpoint. None if loading the checkpoint failed client_state: State dictionary used for loading required training states in the client code.

Return type:

load_path

load_module_state_dict(state_dict, strict=True)
loss_scale()
memory_breakdown()
module_state_dict(destination=None, prefix='', keep_vars=False)
optimizer_legacy_fusion()
optimizer_name()
optimizer_params()
postscale_gradients()
save_checkpoint(save_dir, tag, client_state={})

Save training checkpoint

Parameters:
  • save_dir – Required. Directory for saving the checkpoint
  • tag – Required. Checkpoint tag used as a unique identifier for the checkpoint. Ex. Global Step.
  • client_state – Optional. State dictionary used for saving required training states in the client code.
scheduler_name()
scheduler_params()
sparse_gradients_enabled()
step()

Execute the weight update step after forward and backward propagation on effective_train_batch

steps_per_print()
tensorboard_enabled()
tensorboard_job_name()
tensorboard_output_path()
train()
train_batch_size()
train_micro_batch_size_per_gpu()
wall_clock_breakdown()
zero_allgather_bucket_size()
zero_allgather_partitions()
zero_allow_untested_optimizer()
zero_contiguous_gradients()
zero_grad()

Zero parameter grads.

zero_max_elements_per_comm()
zero_optimization()
zero_optimization_partition_gradients()
zero_optimization_stage()
zero_overlap_comm()
zero_reduce_bucket_size()
zero_reduce_scatter()
deepspeed.pt.deepspeed_light.print_configuration(args, name)
deepspeed.pt.deepspeed_light.split_half_float_double_csr(tensors)

deepspeed.pt.deepspeed_lr_schedules module

Copyright 2019 The Microsoft DeepSpeed Team

Implementation of learning rate schedules.

Taken and modified from PyTorch v1.0.1 source https://github.com/pytorch/pytorch/blob/v1.1.0/torch/optim/lr_scheduler.py

class deepspeed.pt.deepspeed_lr_schedules.LRRangeTest(optimizer: <sphinx.ext.autodoc.importer._MockObject object at 0x7fa815f569e8>, lr_range_test_min_lr: float = 0.001, lr_range_test_step_size: int = 2000, lr_range_test_step_rate: float = 1.0, lr_range_test_staircase: bool = False, last_batch_iteration: int = -1)

Bases: object

Sets the learning rate of each parameter group according to learning rate range test (LRRT) policy. The policy increases learning rate starting from a base value with a constant frequency, as detailed in the paper `A disciplined approach to neural network hyper-parameters: Part1`_.

LRRT policy is used for finding maximum LR that trains a model without divergence, and can be used to configure the LR boundaries for Cylic LR schedules.

LRRT changes the learning rate after every batch. step should be called after a batch has been used for training.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.
  • lr_range_test_min_lr (float or list) – Initial learning rate which is the lower boundary in the range test for each parameter group.
  • lr_range_test_step_size (int) – Interval of training steps to increase learning rate. Default: 2000
  • lr_range_test_step_rate (float) – Scaling rate for range test. Default: 1.0
  • lr_range_test_staircase (bool) – Scale in staircase fashion, rather than continous. Default: False.
  • last_batch_iteration (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_batch_iteration=-1, the schedule is started from the beginning. Default: -1

Example

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = torch.optim.LRRangeTest(optimizer)
>>> data_loader = torch.utils.data.DataLoader(...)
>>> for epoch in range(10):
>>>     for batch in data_loader:
>>>         train_batch(...)
>>>         scheduler.step()

_A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay: https://arxiv.org/abs/1803.09820

get_lr()
load_state_dict(sd)
state_dict()
step(batch_iteration=None)
class deepspeed.pt.deepspeed_lr_schedules.OneCycle(optimizer, cycle_min_lr, cycle_max_lr, decay_lr_rate=0.0, cycle_first_step_size=2000, cycle_second_step_size=None, cycle_first_stair_count=0, cycle_second_stair_count=None, decay_step_size=0, cycle_momentum=True, cycle_min_mom=0.8, cycle_max_mom=0.9, decay_mom_rate=0.0, last_batch_iteration=-1)

Bases: object

Sets the learning rate of each parameter group according to 1Cycle learning rate policy (1CLR). 1CLR is a variation of the Cyclical Learning Rate (CLR) policy that involves one cycle followed by decay. The policy simultaneously cycles the learning rate (and momentum) between two boundaries with a constant frequency, as detailed in the paper A disciplined approach to neural network hyper-parameters.

1CLR policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This implementation was adapted from the github repo: `pytorch/pytorch`_

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.
  • cycle_min_lr (float or list) – Initial learning rate which is the lower boundary in the cycle for each parameter group.
  • cycle_max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (cycle_max_lr - cycle_min_lr). The lr at any cycle is the sum of cycle_min_lr and some scaling of the amplitude; therefore cycle_max_lr may not actually be reached depending on scaling function.
  • decay_lr_rate (float) – Decay rate for learning rate. Default: 0.
  • cycle_first_step_size (int) – Number of training iterations in the increasing half of a cycle. Default: 2000
  • cycle_second_step_size (int) – Number of training iterations in the decreasing half of a cycle. If cycle_second_step_size is None, it is set to cycle_first_step_size. Default: None
  • cycle_first_stair_count (int) – Number of stairs in first half of cycle phase. This means
  • are changed in staircase fashion. Default 0, means staircase disabled. (lr/mom) –
  • cycle_second_stair_count (int) – Number of stairs in second half of cycle phase. This means
  • are changed in staircase fashion. Default 0, means staircase disabled.
  • decay_step_size (int) – Intervals for applying decay in decay phase. Default: 0, means no decay.
  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘cycle_min_mom’ and ‘cycle_max_mom’. Default: True
  • cycle_min_mom (float or list) – Initial momentum which is the lower boundary in the cycle for each parameter group. Default: 0.8
  • cycle_max_mom (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (cycle_max_mom - cycle_min_mom). The momentum at any cycle is the difference of cycle_max_mom and some scaling of the amplitude; therefore cycle_min_mom may not actually be reached depending on scaling function. Default: 0.9
  • decay_mom_rate (float) – Decay rate for momentum. Default: 0.
  • last_batch_iteration (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_batch_iteration=-1, the schedule is started from the beginning. Default: -1

Example

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = torch.optim.OneCycle(optimizer)
>>> data_loader = torch.utils.data.DataLoader(...)
>>> for epoch in range(10):
>>>     for batch in data_loader:
>>>         train_batch(...)
>>>         scheduler.step()
get_lr()

Calculates the learning rate at batch index. This function treats self.last_batch_iteration as the last batch index.

If self.cycle_momentum is True, this function has a side effect of updating the optimizer’s momentum.

load_state_dict(sd)
state_dict()
step(batch_iteration=None)
class deepspeed.pt.deepspeed_lr_schedules.WarmupLR(optimizer: <sphinx.ext.autodoc.importer._MockObject object at 0x7fa815f569e8>, warmup_min_lr: float = 0.0, warmup_max_lr: float = 0.001, warmup_num_steps: int = 1000, last_batch_iteration: int = -1)

Bases: object

Increase the learning rate of each parameter group from min lr to max lr over warmup_num_steps steps, and then fix at max lr.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.
  • warmup_min_lr (float or list) – minimum learning rate. Default: 0
  • warmup_max_lr (float or list) – maximum learning rate. Default: 0.001
  • warmup_num_steps (int) – number of steps to warm up from min_lr to max_lr. Default: 1000
  • last_batch_iteration (int) – The index of the last batch. Default: -1.

Example

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = torch.optim.WarmupLR(optimizer)
>>> data_loader = torch.utils.data.DataLoader(...)
>>> for epoch in range(10):
>>>     for batch in data_loader:
>>>         train_batch(...)
>>>         scheduler.step()
get_lr()
load_state_dict(sd)
state_dict()
step(last_batch_iteration=None)
deepspeed.pt.deepspeed_lr_schedules.add_tuning_arguments(parser)
deepspeed.pt.deepspeed_lr_schedules.get_config_from_args(args)
deepspeed.pt.deepspeed_lr_schedules.get_lr_from_config(config)
deepspeed.pt.deepspeed_lr_schedules.get_torch_optimizer(optimizer)
deepspeed.pt.deepspeed_lr_schedules.override_1cycle_params(args, params)
deepspeed.pt.deepspeed_lr_schedules.override_lr_range_test_params(args, params)
deepspeed.pt.deepspeed_lr_schedules.override_params(args, params)
deepspeed.pt.deepspeed_lr_schedules.override_warmupLR_params(args, params)
deepspeed.pt.deepspeed_lr_schedules.parse_arguments()

deepspeed.pt.deepspeed_run module

Copyright 2020 The Microsoft DeepSpeed Team

deepspeed.pt.deepspeed_run.encode_world_info(world_info)
deepspeed.pt.deepspeed_run.fetch_hostfile(hostfile_path)
deepspeed.pt.deepspeed_run.main(args=None)
deepspeed.pt.deepspeed_run.parse_args(args=None)
deepspeed.pt.deepspeed_run.parse_inclusion_exclusion(resource_pool, inclusion, exclusion)
deepspeed.pt.deepspeed_run.parse_resource_filter(host_info, include_str='', exclude_str='')

Parse an inclusion or exclusion string and filter a hostfile dictionary.

String format is NODE_SPEC[@NODE_SPEC …], where
NODE_SPEC = NAME[:SLOT[,SLOT …]].

If :SLOT is omitted, include/exclude all slots on that host.

Examples

include_str=”worker-0@worker-1:0,2” will use all slots on worker-0 and
slots [0, 2] on worker-1.
exclude_str=”worker-1:0” will use all available resources except
slot 0 on worker-1.

deepspeed.pt.deepspeed_timer module

Copyright 2019 The Microsoft DeepSpeed Team

class deepspeed.pt.deepspeed_timer.SynchronizedWallClockTimer

Bases: object

Group of timers. Borrowed from Nvidia Megatron code

class Timer(name)

Bases: object

Timer.

elapsed(reset=True)

Calculate the elapsed time.

reset()

Reset timer.

start()

Start the timer.

stop()

Stop the timer.

log(names, normalizer=1.0, reset=True, memory_breakdown=False)

Log a group of timers.

static memory_usage()
class deepspeed.pt.deepspeed_timer.ThroughputTimer(batch_size, num_workers, start_step=2, steps_per_output=50, monitor_memory=True, logging_fn=None)

Bases: object

avg_samples_per_sec()
start()
stop(report_speed=True)
update_epoch_count()
deepspeed.pt.deepspeed_timer.print_rank_0(message)

deepspeed.pt.deepspeed_utils module

Copyright 2019 The Microsoft DeepSpeed Team

Copyright NVIDIA/Megatron

Helper functions and classes from multiple sources.

class deepspeed.pt.deepspeed_utils.CheckOverflow(param_groups=None, mpu=None, zero_reduce_scatter=False)

Bases: object

Checks for overflow in gradient across parallel process

check(param_groups=None)
check_using_norm(norm_group)
has_overflow(params)
has_overflow_serial(params)
deepspeed.pt.deepspeed_utils.get_grad_norm(parameters, norm_type=2, mpu=None)

Clips gradient norm of an iterable of parameters.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters. Note that the gradients are modified in place. Taken from Nvidia Megatron.

Parameters:
  • parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have gradients normalized
  • max_norm (float or int) – max norm of the gradients
  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.
Returns:

Total norm of the parameters (viewed as a single vector).

deepspeed.pt.deepspeed_utils.get_weight_norm(parameters, norm_type=2, mpu=None)

Clips gradient norm of an iterable of parameters.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters. Note that the gradients are modified in place. Taken from Nvidia Megatron.

Parameters:
  • parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have gradients normalized
  • max_norm (float or int) – max norm of the gradients
  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.
Returns:

Total norm of the parameters (viewed as a single vector).

deepspeed.pt.deepspeed_utils.is_model_parallel_parameter(p)
deepspeed.pt.deepspeed_utils.see_memory_usage(message)

deepspeed.pt.deepspeed_zero_optimizer module

Copyright 2019 The Microsoft DeepSpeed Team

class deepspeed.pt.deepspeed_zero_optimizer.FP16_DeepSpeedZeroOptimizer(init_optimizer, timers, static_loss_scale=1.0, dynamic_loss_scale=False, dynamic_loss_args=None, verbose=True, contiguous_gradients=True, reduce_bucket_size=500000000, allgather_bucket_size=5000000000, dp_process_group=None, reduce_scatter=True, overlap_comm=False, mpu=None, clip_grad=0.0, allreduce_always_fp32=False, postscale_gradients=True, gradient_predivide_factor=1.0)

Bases: object

DeepSpeedZeroOptimizer designed to reduce the memory footprint required for training large deep learning models.

For more details please see ZeRO: Memory Optimization Towards Training A Trillion Parameter Models https://arxiv.org/abs/1910.02054

For usage examples, refer to TODO: DeepSpeed Tutorial

allreduce_and_copy(small_bucket, rank=None, log=None)
allreduce_bucket(bucket, allreduce_always_fp32=False, rank=None, log=None)
allreduce_no_retain(bucket, numel_per_bucket=500000000, rank=None, log=None)
average_tensor(tensor)
backward(loss, retain_graph=False)

backward performs the following steps:

  1. fp32_loss = loss.float()
  2. scaled_loss = fp32_loss*loss_scale
  3. scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the model’s fp16 leaves
buffered_reduce_fallback(rank, grads, elements_per_buffer=500000000, log=None)
check_overflow(partition_gradients=True)
copy_grads_in_partition(param)
create_reduce_and_remove_grad_hooks()
cur_scale
flatten_and_print(message, tensors, start=0, n=5)
free_grad_in_param_list(param_list)
get_data_parallel_partitions(tensor)
get_first_param_index(group_id, param_group, partition_id)
get_flat_partition(tensor_list, first_offset, partition_size, dtype, device, return_tensor_list=False)
get_grad_norm_direct(gradients, params, norm_type=2)

Clips gradient norm of an iterable of parameters.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters. Note that the gradients are modified in place.

Parameters:
  • parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have gradients normalized
  • max_norm (float or int) – max norm of the gradients
  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.
Returns:

Total norm of the parameters (viewed as a single vector).

get_grads_to_reduce(i, partition_id)
get_param_id(param)
get_partition_info(tensor_list, partition_size, partition_id)
gradient_reduction_w_predivide(tensor)
has_overflow(partition_gradients=True)
has_overflow_partitioned_grads_serial()
has_overflow_serial(params, is_grad_list=False)
independent_gradient_partition_epilogue()
initialize_gradient_partition(i, param_group, partition_id)
initialize_gradient_partitioning_data_structures()
initialize_optimizer_states()
load_state_dict(state_dict, load_optimizer_states=True)

Loads a state_dict created by an earlier call to state_dict(). If fp16_optimizer_instance was constructed from some init_optimizer, whose parameters in turn came from model, it is expected that the user will call model.load_state_dict() before fp16_optimizer_instance.load_state_dict() is called. Example:

model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
...
checkpoint = torch.load("saved.pth")
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
loss_scale
overlapping_partition_gradients_reduce_epilogue()
param_groups
print_rank_0(message)
reduce_independent_p_g_buckets_and_remove_grads(param, i)
reduce_ipg_grads()
reduce_ready_partitions_and_remove_grads(param, i)
refresh_fp32_params()
report_ipg_memory_usage(tag, param_elems)
reset_partition_gradient_structures()
sequential_execution(function, message, group=None)
set_none_gradients_to_zero(i, partition_id)
state
state_dict()

Returns a dict containing the current state of this FP16_Optimizer instance. This dict contains attributes of FP16_Optimizer, as well as the state_dict of the contained Pytorch optimizer. Example:

checkpoint = {}
checkpoint['model'] = model.state_dict()
checkpoint['optimizer'] = optimizer.state_dict()
torch.save(checkpoint, "saved.pth")
step(closure=None)

Not supporting closure.

unscale_and_clip_grads(grad_groups_flat, norm_groups)
zero_grad(set_grads_to_None=True)

Zero FP16 parameter grads.

zero_reduced_gradients(partition_id, i)
deepspeed.pt.deepspeed_zero_optimizer.flatten_dense_tensors_aligned(tensor_list, alignment, pg)
deepspeed.pt.deepspeed_zero_optimizer.input(msg)
deepspeed.pt.deepspeed_zero_optimizer.isclose(a, b, rtol=1e-09, atol=0.0)
deepspeed.pt.deepspeed_zero_optimizer.lcm(x, y)
deepspeed.pt.deepspeed_zero_optimizer.move_to_cpu(tensor_list)
deepspeed.pt.deepspeed_zero_optimizer.split_half_float_double(tensors)

deepspeed.pt.fp16_optimizer module

Copyright 2019 The Microsoft DeepSpeed Team

Copyright NVIDIA/apex This file is adapted from FP16_Optimizer in NVIDIA/apex

class deepspeed.pt.fp16_optimizer.FP16_Optimizer(init_optimizer, static_loss_scale=1.0, dynamic_loss_scale=False, initial_dynamic_scale=4294967296, dynamic_loss_args=None, verbose=True, mpu=None, clip_grad=0.0, fused_adam_legacy=False, timers=None)

Bases: object

FP16 Optimizer for training fp16 models. Handles loss scaling.

For usage example please see, TODO: DeepSpeed V2 Tutorial

backward(loss)

backward performs the following steps:

  1. fp32_loss = loss.float()
  2. scaled_loss = fp32_loss*loss_scale
  3. scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the model’s fp16 leaves
load_state_dict(state_dict, load_optimizer_states=True)

Loads a state_dict created by an earlier call to state_dict(). If fp16_optimizer_instance was constructed from some init_optimizer, whose parameters in turn came from model, it is expected that the user will call model.load_state_dict() before fp16_optimizer_instance.load_state_dict() is called. Example:

model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
...
checkpoint = torch.load("saved.pth")
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
log_timers(name_list)
param_groups
refresh_fp32_params()
start_timers(name_list)
state
state_dict()

Returns a dict containing the current state of this FP16_Optimizer instance. This dict contains attributes of FP16_Optimizer, as well as the state_dict of the contained Pytorch optimizer. Example:

checkpoint = {}
checkpoint['model'] = model.state_dict()
checkpoint['optimizer'] = optimizer.state_dict()
torch.save(checkpoint, "saved.pth")
step(closure=None)

Not supporting closure.

step_fused_adam(closure=None)

Not supporting closure.

stop_timers(name_list)
unscale_and_clip_grads(grad_groups_flat, norm_groups, apply_scale=True)
zero_grad(set_grads_to_None=True)

Zero FP16 parameter grads.

deepspeed.pt.fp16_unfused_optimizer module

Copyright 2019 The Microsoft DeepSpeed Team

Copyright NVIDIA/apex This file is adapted from FP16_Optimizer in NVIDIA/apex

class deepspeed.pt.fp16_unfused_optimizer.FP16_UnfusedOptimizer(init_optimizer, static_loss_scale=1.0, dynamic_loss_scale=False, dynamic_loss_args=None, verbose=True, mpu=None, clip_grad=0.0, fused_lamb_legacy=False)

Bases: object

FP16 Optimizer without weight fusion to support LAMB optimizer

For usage example please see, TODO: DeepSpeed V2 Tutorial

backward(loss)

backward performs the following steps:

  1. fp32_loss = loss.float()
  2. scaled_loss = fp32_loss*loss_scale
  3. scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the model’s fp16 leaves
load_state_dict(state_dict, load_optimizer_states=True)

Loads a state_dict created by an earlier call to state_dict(). If fp16_optimizer_instance was constructed from some init_optimizer, whose parameters in turn came from model, it is expected that the user will call model.load_state_dict() before fp16_optimizer_instance.load_state_dict() is called. Example:

model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
...
checkpoint = torch.load("saved.pth")
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
param_groups
state
state_dict()

Returns a dict containing the current state of this FP16_Optimizer instance. This dict contains attributes of FP16_Optimizer, as well as the state_dict of the contained Pytorch optimizer. Example:

checkpoint = {}
checkpoint['model'] = model.state_dict()
checkpoint['optimizer'] = optimizer.state_dict()
torch.save(checkpoint, "saved.pth")
step(closure=None)

Not supporting closure.

step_fused_lamb(closure=None)

Not supporting closure.

unscale_and_clip_grads(norm_groups, apply_scale=True)
zero_grad(set_grads_to_None=True)

Zero FP16 parameter grads.

deepspeed.pt.loss_scaler module

class deepspeed.pt.loss_scaler.DynamicLossScaler(init_scale=4294967296, scale_factor=2.0, scale_window=1000, min_scale=1, delayed_shift=1, consecutive_hysteresis=False)

Bases: deepspeed.pt.loss_scaler.LossScalerBase

Class that manages dynamic loss scaling. It is recommended to use DynamicLossScaler indirectly, by supplying dynamic_loss_scale=True to the constructor of FP16_Optimizer. However, it’s important to understand how DynamicLossScaler operates, because the default options can be changed using the the dynamic_loss_args argument to FP16_Optimizer’s constructor.

Loss scaling is designed to combat the problem of underflowing gradients encountered at long times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are encountered, DynamicLossScaler informs FP16_Optimizer that an overflow has occurred. FP16_Optimizer then skips the update step for this particular iteration/minibatch, and DynamicLossScaler adjusts the loss scale to a lower value. If a certain number of iterations occur without overflowing gradients detected, DynamicLossScaler increases the loss scale once more. In this way DynamicLossScaler attempts to “ride the edge” of always using the highest loss scale possible without incurring overflow.

Parameters:
  • init_scale (float, optional, default=2**32) – Initial loss scale attempted by DynamicLossScaler.
  • scale_factor (float, optional, default=2.0) – Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/scale_factor. If scale_window consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
  • scale_window (int, optional, default=1000) – Number of consecutive iterations without an overflow to wait before increasing the loss scale.
has_overflow_serial(params)
update_scale(overflow)
class deepspeed.pt.loss_scaler.LossScaler(scale=1)

Bases: deepspeed.pt.loss_scaler.LossScalerBase

Class that manages a static loss scale. This class is intended to interact with FP16_Optimizer, and should not be directly manipulated by the user.

Use of LossScaler is enabled via the static_loss_scale argument to FP16_Optimizer’s constructor.

Parameters:scale (float, optional, default=1.0) – The loss scale.
has_overflow(params)
class deepspeed.pt.loss_scaler.LossScalerBase(cur_scale)

Bases: object

LossScalarBase Base class for a loss scaler

backward(loss, retain_graph=False)
loss_scale
scale_gradient(module, grad_in, grad_out)
update_scale(overflow)
deepspeed.pt.loss_scaler.to_python_float(t)

Module contents