Skip to main content
JobCannon
All Skills

Model Quantization Compression

Tier 3
Category
Tech
Salary Impact
Complexity
Difficult
Used in
All careers

Quantization is a technique to reduce machine learning model size and inference latency by using lower-precision number formats. A typical model uses 32-bit floats (float32). Quantization converts weights and activations to 8-bit integers (int8) or 4-bit integers (int4), reducing model size by 4-8x with minimal accuracy loss. Compressed models run on resource-constrained devices: mobile phones, edge servers, embedded systems. A 1GB model becomes 125MB, enabling on-device inference without cloud calls.