Historical Origins: Binet & Simon (1905)
Alfred Binet and Théodore Simon's 1905 "La Mesure de l'Intelligence" (The Measurement of Intelligence) initiated modern intelligence testing. France's mandatory education law required identifying students needing specialized instruction; Binet and Simon developed brief cognitive tests (memory, reasoning, judgment) administered in controlled sequence, producing quantitative index of cognitive performance.
Unlike earlier phrenological approaches (skull measurements presumed intelligence), or Cattell's mental tests emphasizing reaction time and sensory discrimination (which showed limited validity), Binet-Simon items targeted higher cognition: how many words can you understand? Can you follow verbal instructions?
Critically, Binet emphasized that intelligence was multifaceted and educable—test performance could improve with instruction. The scale provided ordinal ranking (this child's performance matches average 8-year-old; that child matches 12-year-old) without claiming fixed intelligence measurement.
Binet's 1911 revision emphasized practical utility: identifying students who would benefit from remedial instruction (below average intelligence) and those ready for advanced curriculum. The approach's predictive validity for academic achievement (r =
50– 70) proved substantially superior to previous methods, launching intelligence testing's 120-year trajectory in psychology.
Spearman's g Factor (1904)
Charles Spearman's 1904 paper in American Journal of Psychology, "General Intelligence Objectively Determined and Measured," introduced the concept of general intelligence (g factor) through factor analysis. Spearman analyzed correlations among diverse cognitive tasks: school grades, sensory discrimination tests, memory tests.
He observed that every cognitive task correlated with every other task (no zero correlations despite apparent diversity), suggesting an underlying common factor (general intelligence) plus task-specific components. Mathematically, Spearman proposed: Performance on task i = g + s_i (where g is general intelligence factor, s_i is task-specific ability).
Spearman's historical significance involved introducing factor analysis to psychology—a revolutionary statistical method enabling discovery of latent constructs. Subsequent research validated g's existence and ubiquity: across diverse task sets (from reaction time to mathematical reasoning to spatial visualization), a single general factor accounts for 30-50% of task intercorrelation variance.
Spearman proposed g reflected "mental energy" or "neural efficiency"; contemporary neuroscience research (Deary et al. 2010, Nature Reviews Neuroscience) identifies gray matter volume (r =
30 with IQ) and white matter integrity (r = 26 with IQ) as biological correlates. Jensen's (1998) meta-analysis documented g's heritability (h² = 50– 80 depending on age), establishing intelligence as partially genetic.
However, 50-60% environmental variance leaves room for substantial developmental influence (Nisbett et al. 2012, American Psychologist comprehensive review finding environment affects IQ +20–30 points across SES gradient).
Wechsler Adult Intelligence Scale-IV (2008)
David Wechsler's (1939) Wechsler-Bellevue scale revolutionized adult intelligence testing, moving beyond Binet's approach (originally designed for children) to create comprehensive adult assessment. The 2008 revision (WAIS-IV) represents current clinical standard, providing four composite scores: (1) Verbal Comprehension Index (vocabulary, similarities, information comprehension), measuring crystallized intelligence (accumulated knowledge); (2) Perceptual Reasoning Index (block design, matrix reasoning, visual puzzles), measuring fluid reasoning and spatial ability; (3) Working Memory Index (digit span, arithmetic, letter-number sequencing), measuring short-term memory capacity and manipulation; (4) Processing Speed Index (digit symbol, symbol search), measuring perceptual speed and efficiency.
The WAIS-IV's 15-subtest structure provides comprehensive cognitive profile: comparing VCI (crystallized) to PRI (fluid) reveals distinct cognitive strengths (some individuals excel at verbal knowledge while struggling with novel problem-solving, or vice versa). Full-Scale IQ (FSIQ) represents composite across all four domains, with mean 100, SD 15 across standardization sample (2,200 nationally representative Americans).
Test-retest reliability for FSIQ: r = 92 (1-month interval), internal consistency α = 98, demonstrating excellent measurement precision. Ceiling and floor effects are minimal across age ranges (16-90 years), enabling comprehensive assessment from intellectually gifted to severely impaired individuals.
Cattell-Horn-Carroll (CHC) Theory
Cattell's 1963 distinction of Fluid Intelligence (Gf—novel problem-solving, not reliant on learned content) versus Crystallized Intelligence (Gc—accumulated knowledge, education-dependent) refined Spearman's unitary g into meaningful subdimensions. This theoretical distinction addressed a puzzling phenomenon: while IQ shows substantial heritability (h² =
60), environmental interventions produce measurable IQ gains (+20 points from intensive preschool; Ramey & Ramey 1998, Psychological Review), primarily through crystallized intelligence improvement (Gc increases with education; Gf shows less educational dependence, more heritable). Cattell's framework explained this: fluid intelligence depends on neural efficiency (relatively stable, heritable) while crystallized intelligence depends on learned content (highly educable).
Carroll's 1993 comprehensive analysis of 461 datasets and factor structures proposed expanded three-stratum model: Stratum I (specific abilities: spelling, verbal fluency, reasoning speed, etc.) ; Stratum II (eight broad factors including Gf, Gc, visual-spatial, memory); Stratum III (general intelligence g).
Horn & Cattell's further refinement identified processing speed (Gs), reaction time reliability, and other domains. Contemporary intelligence assessment (Woodcock-Johnson, DAS-II) increasingly adopted CHC framework, recognizing multiple intelligence dimensions rather than single g.
This theory predicts that educational intervention improves Gc substantially (r = 40 gain from schooling) while Gf shows limited improvement (r = 15), which research validates (Winne et al. 1980 meta-analysis).
Flynn Effect: +3 IQ Points per Decade
James Flynn's (1987) observation that standardized IQ test performance increased 3 IQ points per decade (30 points per century) over the 20th century challenged heritability-focused interpretations. This dramatic increase—raw performance improvements evident in IQ test standardization data across 1920-2000—occurred too rapidly for genetic change, necessarily reflecting environmental factors.
Flynn documented gains across diverse countries and cultures (Netherlands, Belgium, Israel, Norway, Japan), with larger gains in performance (nonverbal IQ, +5 points/decade) than verbal IQ (+2 points/decade). Plausible environmental candidates include improved nutrition (height gains parallel IQ gains), greater environmental complexity (visual media, information access increasing Gf), increased test familiarity and educational attainment.
However, paradoxically, gains on simple reaction time did not increase proportionally to IQ gains, suggesting improved problem-solving approach and abstract thinking rather than improved neural processing speed (contradicting some neural efficiency theories). Recent evidence (Pietschnig & Voracek 2015, Psychological Bulletin meta-analysis) documents reversal in developed nations since 1995—IQ score gains declining or reversing (negative Flynn Effect in Denmark, Norway, France, UK), possibly reflecting educational changes, lead exposure decline benefits plateauing, or environmental complexity stability.
The Flynn Effect's existence proved theoretically important: it demonstrated environmental contribution to measured intelligence despite heritability, moving debate from nature-or-nurture to how-much-each-contributes (contemporary consensus: h² = 50 meaning 50% genetic, 50% environmental at population level).
Stereotype Threat and Test Performance
Claude Steele's 1997 paper "A Threat in the Air: How Stereotypes Shape Intellectual Identity and Performance," introduced stereotype threat—impaired performance when individuals are aware of negative stereotypes about their group's intelligence. Steele documented that under standard testing conditions, Black and White college students of equal prior ability showed significant performance gaps (+0
5 SD) on difficult math problems; however, when test was described as "diagnostic of ability" versus "non-diagnostic laboratory problem-solving task," the performance gap disappeared. The mechanism: individuals from stereotyped groups ("Black people are worse at math," "women are worse at STEM") experience cognitive load from concerns about confirming the stereotype, redirecting working memory capacity from problem-solving to emotion regulation and self-relevant thoughts.
Meta-analysis (Frantz et al. 2015, Journal of Personality and Social Psychology) across 170+ studies documents stereotype threat effects across diverse domain-groups: women in math/science (d = 0
30), White men in athletics (d = 0 24), older adults in memory tasks (d = 0 28), high-SES students in verbal tasks. Critically, interventions mitigating stereotype threat (self-affirmation exercises, growth mindset instruction, reducing stereotype salience) improved performance in affected groups 10-30%, with effects persisting to follow-up assessments.
This research revolutionized interpretation of group differences on intelligence tests: observed differences do not necessarily reflect ability differences but can reflect test-taking conditions and stereotype activation. Implications for intelligence testing include: interpreting group differences cautiously (attributing to stereotype threat possibility), optimizing testing conditions to minimize stereotype salience, and recognizing intelligence tests' socioculturally embedded nature.