Quantization Utilities¶

Reference Implementation Methods¶

template<typename T, layout_t LAYOUT = layout_t::KCX> void QuantizeGroupwise(const float *src, int K, int C, int X, int G, const float *scales, const std::int32_t *zero_points, T *dst)¶

Quantize floating point data in src to type T.

Template Parameters:

T – output quantized data type (int8_t, uint8_t, and int32_t are supported)
LAYOUT – layout of input tensor in src. (KCX and KXC are supported) KCX corresponds to KCRS or KCTRS (for weight tensors with time dimension) KXC corresponds to KRSC or KTRSC (for weight tensors with time dimension)

Parameters:

K – Output channels for weight tensors
C – Number of channels
X – R*S or T*R*S
G – Groups (if G == C the function performs channelwise quantization; if 1 < G < C the function performs groupwise quantization; if G == 1 the function performs per tensor quantization;)
scales – floating point scales. Size should be equal G
zero_points – zero points (should be reprsentable in type T). Size should be equal G

template<typename T> void FusedQuantizeDequantize(const float *src, float *dst, std::int64_t len, const TensorQuantizationParams &qparams, int thread_id = 0, int num_threads = 1, float noise_ratio = 0.0f)¶: Fused integer quantization dequantization kernel to accelerate quantization-aware training. Quantize fp32 values in src to (u)int8 using the provided qparams, and dequantize quantized integer values back into fp32.

template<typename InputType> void FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf(int bit_rate, const InputType *input, size_t input_rows, int input_columns, std::uint8_t *output)¶

Convert float (fp32 or fp16) inputs to rowwise quantized outputs. bitrate specifies the number of bits in quantized output. Scale and Bias are in fp16. Each row’s Scale and Bias are stored in the row itself (fused) at the end.

Parameters:: bit_rate – can be 2, 4, or 8

AVX-2 Implementation Methods¶

uint32_t Xor128(void)¶: Random number generator in [0, 9] based on this paper.

void FindMinMax(const float *m, float *min, float *max, int64_t len)¶: Find the min and max value in a float matrix.

template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, typename BIAS_TYPE = std::int32_t, bool DIRECT = false> void requantizeOutputProcessingAvx2(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶: Requantize with avx2 and bias is fused.

AVX-512 Implementation Methods¶

template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, int C_PER_G, typename BIAS_TYPE = std::int32_t> void requantizeOutputProcessingGConvAvx512(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶: Requantize with AVX512.

Quantization Utilities¶

Reference Implementation Methods¶

AVX-2 Implementation Methods¶

AVX-512 Implementation Methods¶

Docs

Tutorials

Resources