Quantization is a technique to reduce machine learning model size and inference latency by using lower-precision number formats. A typical model uses 32-bit floats (float32). Quantization converts weights and activations to 8-bit integers (int8) or 4-bit integers (int4), reducing model size by 4-8x with minimal accuracy loss. Compressed models run on resource-constrained devices: mobile phones, edge servers, embedded systems. A 1GB model becomes 125MB, enabling on-device inference without cloud calls.