Transformer Kernels

The transformer kernel API in DeepSpeed can be used to create BERT transformer layer for more efficient pre-training and fine-tuning, it includes the transformer layer configurations and transformer layer module initialization.

Here we present the transformer kernel API. Please see the BERT pre-training tutorial for usage details.

DeepSpeed Transformer Config

class deepspeed.DeepSpeedTransformerConfig(batch_size=-1, hidden_size=-1, intermediate_size=-1, heads=-1, attn_dropout_ratio=-1, hidden_dropout_ratio=-1, num_hidden_layers=-1, initializer_range=-1, layer_norm_eps=1e-12, local_rank=-1, seed=-1, fp16=False, pre_layer_norm=True, normalize_invertible=False, gelu_checkpoint=False, adjust_init_range=True, attn_dropout_checkpoint=False, stochastic_mode=False, huggingface=False, training=True)[source]

Initialize the DeepSpeed Transformer Config.

Parameters:
  • batch_size – The maximum batch size used for running the kernel on each GPU
  • max_seq_length – The sequence-length of the model being trained with DeepSpeed
  • hidden_size – The hidden size of the transformer layer
  • intermediate_size – The intermediate size of the feed-forward part of transformer layer
  • heads – The number of heads in the self-attention of the transformer layer
  • attn_dropout_ratio – The ratio of dropout for the attention’s output
  • hidden_dropout_ratio – The ratio of dropout for the transformer’s output
  • num_hidden_layers – The number of transformer layers
  • initializer_range – BERT model’s initializer range for initializing parameter data
  • local_rank – Optional: The rank of GPU running the transformer kernel, it is not required to use if the model already set the current device, otherwise need to set it so that the transformer kernel can work on the right device
  • seed – The random seed for the dropout layers
  • fp16 – Enable half-precision computation
  • pre_layer_norm – Select between Pre-LN or Post-LN transformer architecture
  • normalize_invertible – Optional: Enable invertible LayerNorm execution (dropping the input activation), default is False
  • gelu_checkpoint – Optional: Enable checkpointing of Gelu activation output to save memory, default is False
  • adjust_init_range

    Optional: Set as True (default) if the model adjusts the weight initial values of its self-attention output and layer output, False keeps the initializer_range no change. See the adjustment below:

    output_std = self.config.initializer_range / math.sqrt(2.0 * num_layers)
  • attn_dropout_checkpoint – Optional: Enable checkpointing of attention dropout to save memory, default is False
  • stochastic_mode – Enable for high performance, please note that this flag has some level of non-determinism and can produce different results on different runs. However, we have seen that by enabling it, the pretraining tasks such as BERT are not affected and can obtain a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend to turn it off in order to be able to reproduce the same result through the regular kernel execution.
  • huggingface – Enbale if using the HuggingFace interface style for sending out the forward results.
  • training – Enable for training rather than inference.

DeepSpeed Transformer Layer

class deepspeed.DeepSpeedTransformerLayer(config, initial_weights=None, initial_biases=None)[source]

Initialize the DeepSpeed Transformer Layer.

Static variable:
layer_id: The layer-index counter starting from 0 and incrementing by 1 every time a layer object is instantiated, e.g. if a model has 24 transformer layers, layer_id goes from 0 to 23.
Parameters:
  • config – An object of DeepSpeedTransformerConfig
  • initial_weights – Optional: Only used for unit test
  • initial_biases – Optional: Only used for unit test