Hands‐on Tensor Quantization - yiliu30/yi GitHub Wiki
Packing for inference
# auto_gptq.nn_modules.qlinear.qlinear_exllamav2.QuantLinear
# auto_gptq.nn_modules.qlinear.qlinear_marlin.QuantLinear
def convert_to_marlin(...):
# bits is 4
# module: auto_gptq.nn_modules.qlinear.qlinear_exllamav2.QuantLinear
# module.qweight: [infeatures//8, outfeatures]
# marlin_repacked_weight: [infeatures // 16][outfeatures * 16 // 8]
# float weight: [infeatures][outfeatures]
# B: [infeatures // 16][outfeatures * 16 // 8] torch.int
# s: [infeatures // group_size][outfeatures] torch.half
# workspace: [outfeatures // 128 * 16] torch.int
``