llama.cpp is a high-performance inference engine for large language models, written in C++ and optimized for CPU inference. It uses the GGML (Generalizable Graph Meta Language) format for quantized models, dramatically reducing memory and compute requirements. llama.cpp enables running billion-parameter models on laptops, servers without GPUs, and embedded devices. It's the foundation for popular local LLM tools (Ollama, GPT4All) and is widely used by developers building privacy-first, edge-deployed AI applications. The project is open-source and continuously optimized; new hardware accelerations (Metal, CUDA, OpenCL) are regularly added.