Transformers are a deep learning architecture based on attention mechanisms, introduced in "Attention Is All You Need" (Vaswani et al., 2017). They revolutionized NLP by enabling parallel processing of sequences, longer context, and better representation learning. Modern language models (GPT, BERT, Claude) are all Transformers. Core concepts: self-attention (tokens attend to each other), multi-head attention (multiple attention patterns), feedforward networks, positional encodings, and scaling laws that govern model performance.