The transformer kernel API in DeepSpeed can be used to create BERT transformer layer for more efficient pre-training and fine-tuning, it includes the transformer layer configurations and transformer layer module initialization.
Here we present the transformer kernel API. Please see the BERT pre-training tutorial for usage details.
DeepSpeed Transformer Config¶
- class deepspeed.DeepSpeedTransformerConfig(batch_size=-1, hidden_size=-1, intermediate_size=-1, heads=-1, attn_dropout_ratio=-1, hidden_dropout_ratio=-1, num_hidden_layers=-1, initializer_range=-1, layer_norm_eps=1e-12, local_rank=-1, seed=-1, fp16=False, pre_layer_norm=True, normalize_invertible=False, gelu_checkpoint=False, adjust_init_range=True, attn_dropout_checkpoint=False, stochastic_mode=False, return_tuple=False, training=True)¶
Initialize the DeepSpeed Transformer Config.
batch_size – The maximum batch size used for running the kernel on each GPU
hidden_size – The hidden size of the transformer layer
intermediate_size – The intermediate size of the feed-forward part of transformer layer
heads – The number of heads in the self-attention of the transformer layer
attn_dropout_ratio – The ratio of dropout for the attention’s output
hidden_dropout_ratio – The ratio of dropout for the transformer’s output
num_hidden_layers – The number of transformer layers
initializer_range – BERT model’s initializer range for initializing parameter data
local_rank – Optional: The rank of GPU running the transformer kernel, it is not required to use if the model already set the current device, otherwise need to set it so that the transformer kernel can work on the right device
seed – The random seed for the dropout layers
fp16 – Enable half-precision computation
pre_layer_norm – Select between Pre-LN or Post-LN transformer architecture
normalize_invertible – Optional: Enable invertible LayerNorm execution (dropping the input activation), default is False
gelu_checkpoint – Optional: Enable checkpointing of Gelu activation output to save memory, default is False
Optional: Set as True (default) if the model adjusts the weight initial values of its self-attention output and layer output, False keeps the initializer_range no change. See the adjustment below:
output_std = self.config.initializer_range / math.sqrt(2.0 * num_layers)
attn_dropout_checkpoint – Optional: Enable checkpointing of attention dropout to save memory, default is False
stochastic_mode – Enable for high performance, please note that this flag has some level of non-determinism and can produce different results on different runs. However, we have seen that by enabling it, the pretraining tasks such as BERT are not affected and can obtain a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend to turn it off in order to be able to reproduce the same result through the regular kernel execution.
return_tuple – Enable if using the return_tuple interface style for sending out the forward results.
training – Enable for training rather than inference.
DeepSpeed Transformer Layer¶
- class deepspeed.DeepSpeedTransformerLayer(config, initial_weights=None, initial_biases=None)¶
Initialize the DeepSpeed Transformer Layer.
- Static variable:
layer_id: The layer-index counter starting from 0 and incrementing by 1 every time a layer object is instantiated, e.g. if a model has 24 transformer layers, layer_id goes from 0 to 23.
config – An object of DeepSpeedTransformerConfig
initial_weights – Optional: Only used for unit test
initial_biases – Optional: Only used for unit test