Quantization Utilities¶
Reference Implementation Methods¶
-
template<typename T, layout_t LAYOUT = layout_t::KCX>
void QuantizeGroupwise(const float *src, int K, int C, int X, int G, const float *scales, const std::int32_t *zero_points, T *dst)¶ Quantize floating point data in
src
to typeT
.- Template Parameters:
T – output quantized data type (
int8_t
,uint8_t
, andint32_t
are supported)LAYOUT – layout of input tensor in
src
. (KCX
andKXC
are supported)KCX
corresponds toKCRS
orKCTRS
(for weight tensors with time dimension)KXC
corresponds toKRSC
orKTRSC
(for weight tensors with time dimension)
- Parameters:
K – Output channels for weight tensors
C – Number of channels
X –
R*S
orT*R*S
G – Groups (if
G == C
the function performs channelwise quantization; if1 < G < C
the function performs groupwise quantization; ifG == 1
the function performs per tensor quantization;)scales – floating point scales. Size should be equal
G
zero_points – zero points (should be reprsentable in type
T
). Size should be equalG
-
template<typename T>
void FusedQuantizeDequantize(const float *src, float *dst, std::int64_t len, const TensorQuantizationParams &qparams, int thread_id = 0, int num_threads = 1, float noise_ratio = 0.0f)¶ Fused integer quantization dequantization kernel to accelerate quantization-aware training. Quantize
fp32
values in src to(u)int8
using the provided qparams, and dequantize quantized integer values back intofp32
.
-
template<typename InputType>
void FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf(int bit_rate, const InputType *input, size_t input_rows, int input_columns, std::uint8_t *output)¶ Convert float (fp32 or fp16) inputs to rowwise quantized outputs. bitrate specifies the number of bits in quantized output. Scale and Bias are in fp16. Each row’s Scale and Bias are stored in the row itself (fused) at the end.
- Parameters:
bit_rate – can be 2, 4, or 8
AVX-2 Implementation Methods¶
-
uint32_t Xor128(void)¶
Random number generator in [0, 9] based on this paper.
-
void FindMinMax(const float *m, float *min, float *max, int64_t len)¶
Find the min and max value in a float matrix.
-
template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, typename BIAS_TYPE = std::int32_t, bool DIRECT = false>
void requantizeOutputProcessingAvx2(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶ Requantize with avx2 and bias is fused.
AVX-512 Implementation Methods¶
-
template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, int C_PER_G, typename BIAS_TYPE = std::int32_t>
void requantizeOutputProcessingGConvAvx512(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶ Requantize with AVX512.