Inference Setup

The entrypoint for inference with DeepSpeed is deepspeed.init_inference().

Example usage:

engine = deepspeed.init_inference(model=net)
deepspeed.init_inference(model, mp_size=1, mpu=None, checkpoint=None, module_key='module', dtype=None, injection_policy=None, replace_method='auto', quantization_setting=None)[source]

Initialize the DeepSpeed InferenceEngine.

Parameters:
  • model – Required: nn.module class before apply any wrappers
  • mp_size – Optional: Desired model parallel size, default is 1 meaning no model parallelism.
  • mpu – Optional: A model parallelism unit object that implements get_{model,data}_parallel_{rank,group,world_size}()
  • checkpoint – Optional: Path to deepspeed compatible checkpoint or path to JSON with load policy.
  • dtype – Optional: Desired model data type, will convert model to this type. Supported target types: torch.half, torch.int8, torch.float
  • injection_policy – Optional: Dictionary mapping a client nn.Module to its corresponding injection policy. e.g., {BertLayer : deepspeed.inference.HFBertLayerPolicy}
  • replace_method – Optional: If ‘auto’ DeepSpeed will automatically try and replace model modules with its optimized versions. If an injection_policy is set this will override the automatic replacement behavior.
  • quantization_setting – Optional: Quantization settings used for quantizing your model using the MoQ. The setting can be one element or a tuple. If one value is passed in, we consider it as the number of groups used in quantization. A tuple is passed in if we want to mention that there is extra-grouping for the MLP part of a Transformer layer (e.g. (True, 8) shows we quantize the model using 8 groups for all the network except the MLP part that we use 8 extra grouping).
Returns:

A deepspeed.InferenceEngine wrapped model.