Skip to main content

Buyer\u2019s guide \u00b7 Multilingual \u00b7 psychometric quality

Guide to multilingual assessment localisation quality for career platforms.

ITC Test Adaptation Guidelines (2018), AERA / APA / NCME Standards Chapter 9 expectations, equivalence testing, and per-language psychometric review for serious multilingual deployments.

In Brief

This guide covers the psychometric localisation work that distinguishes a serious multilingual career-assessment deployment from a translated-only deployment. It explains the difference between translation and psychometric localisation \u2014 the latter requires equivalence testing, item-level differential-functioning analysis, and per-language documentation that the assessment produces psychometric properties (reliability, validity, factor structure) equivalent to the source language. It walks through the International Test Commission\u2019s Guidelines for Translating and Adapting Tests (2018 edition) and the AERA / APA / NCME Standards for Educational and Psychological Testing Chapter 9, the two authoritative international frameworks. It explains the three-phase equivalence-testing sequence \u2014 configural, metric, scalar \u2014 and the data requirements that make proper localisation expensive and slow. It discloses JobCannon\u2019s honest current state: English shipped, Spanish in active build, Portuguese / Arabic / Ukrainian sponsorable, other languages conditional on partnership commitment, and explains why honest disclosure is preferable to wide-coverage claims with thin evidence. It walks through a six-checkpoint buyer evaluation framework and closes with a five-component defensible deployment plan.

Chapters in this guide

A reading map for institutional buyers considering multilingual deployments.

Translation vs psychometric localisation
Why translation alone produces unreliable results, and what equivalence testing requires.
ITC and AERA / APA / NCME frameworks
The two authoritative international standards and what they expect of test developers and users.
Three-phase equivalence testing
Configural, metric, scalar equivalence and the data requirements at each phase.
Honest current state and roadmap
JobCannon’s English shipped, Spanish in build, sponsorable-language model with full disclosure.

Assessment battery in production English

Spanish in active build. Portuguese, Arabic, Ukrainian sponsorable.

Career orientation
Cross-cultural well-validated
Personality and traits
Big Five strongest cross-cultural evidence

Compared to multilingual-deployed assessment platforms

For an institutional deployment serving 10,000 participants across multiple languages

$120-300K/yr
Pearson TalentLens multilingual
Per-instrument per-language licensing
$80-200K/yr
SHL Global Assessment Library
Per-language per-test licensing
$60-150K/yr
Hogan multilingual
Coach-tier per-language licensing
$0
JobCannon
Unlimited, forever

What this guide covers

Translation versus psychometric localisation
ITC Test Adaptation Guidelines (2018)
AERA / APA / NCME Standards Chapter 9 expectations
Three-phase equivalence testing (configural, metric, scalar)
Differential item functioning analysis
JobCannon multilingual current state and roadmap
Six-checkpoint buyer evaluation framework
Five-component defensible deployment plan

Related on JobCannon

This guide is one of twenty in the JobCannon for Business reading library; localisation buyers reading the equivalence-validation framing here also read the refugee employment skill-mapping guide for cross-language deployment patterns and the NGO grant reporting guide for how multilingual data flows into common-set funder outcome measures.

For the operational landing where multilingual deployment matters most, see our out-of-school-youth vertical, where ESL-heavy WIOA Title I cohorts and bilingual workforce programmes drive most localisation requests.

Pricing for multilingual deployments

English-shipped assessments stay free under an institutional partnership. Spanish ships when equivalence-validation completes. Sponsorable languages (Portuguese, Arabic, Ukrainian) available under partnership engagement that underwrites the localisation work.

Starter

Try it with a micro-team

$0
  • 5 invites (one-time, not recurring)
  • All 50+ assessments
  • Basic individual reports
  • Share link via email or Slack
  • No credit card required
Request free access

Coach

For independent coaches and therapists

$29/mo
or $290/yr (save 17%)
  • 30 invites per month
  • All 50+ assessments
  • Detailed individual reports
  • Coach notes per client
  • PDF export (client-ready)
  • Session prep recommendations
Get Coach access
Most Popular

Team

For startups, teams and HR

$79/mo
or $790/yr (save 17%)
  • 100 invites per month
  • Everything in Coach
  • Team DNA dashboard
  • Compatibility matrix
  • Conflict-pattern detection
  • Compare 2-3 team members
Get Team access
Recommended

Business

For agencies, L&D and scale-ups

$199/mo
or $1990/yr (save 17%)
  • 500 invites per month
  • Everything in Team
  • White-label PDF reports (your logo)
  • API access (read-only results)
  • Custom assessment builder (beta)
  • Bulk CSV import/export
Get Business access

Enterprise

For 200+ person companies

From $5k/yr
  • Unlimited invites
  • Everything in Business
  • SSO (SAML, Google Workspace)
  • SLA (99.9% uptime)
  • Data residency options (EU/US)
  • Dedicated Customer Success
Talk to us

All plans currently activated manually via the contact form — we review each request within 24 hours and provision access the same day. Self-serve checkout coming once we've heard from the first wave of teams.

Talk to a localisation specialist

Tell us your role, your population languages, and the stakes of the assessment use. We respond within one business day with an honest evidence-and-roadmap response.

We reply within 24 hours. No spam, no per-seat pitches.

FAQ

What does psychometric localisation actually require, and why is it different from translation?

Translation is the process of rendering text in a target language so that meaning is preserved. Psychometric localisation is the much stronger requirement that an assessment instrument administered in the target language produces psychometric properties — reliability, validity, factor structure, item-level functioning — equivalent to the source-language version. The difference matters because items that are translated literally often function differently across languages even when the meaning is preserved. An item asking about “standing up for myself in conflict” works psychometrically in American English where it captures a recognisable construct; the literal translation in Japanese may capture a different and more culturally loaded construct because the cultural framing of conflict differs. Psychometric localisation therefore requires both translation quality and equivalence testing, with adjustment of items that fail equivalence in the target language. The authoritative international guidance is the International Test Commission’s Guidelines for Translating and Adapting Tests, currently in its second edition (2018) and broadly adopted as best practice by major test publishers and academic researchers. The ITC guidelines specify pre-condition, test-development, confirmation, administration, score-interpretation, and documentation phases with detailed procedural recommendations. Other authoritative frameworks include the AERA / APA / NCME Standards for Educational and Psychological Testing (2014), particularly Chapter 9 on the testing of individuals of diverse linguistic backgrounds. A platform that ships an assessment in multiple languages without psychometric localisation work has produced translations, not localised assessments; results in those languages should be interpreted with substantial caution.

How is equivalence between language versions actually tested?

Equivalence testing has multiple components, with the choice of methods depending on the construct, the item types, and the available data. The standard sequence has three phases. First, configural equivalence — testing that the same factor structure holds across languages, typically through multi-group confirmatory factor analysis. The Big Five, for example, should produce five correlated factors in each language version with similar item-to-factor loading patterns. Failure of configural equivalence indicates that the target-language version is measuring a different construct or a differently-organised version of the construct. Second, metric equivalence — testing that the loadings of items on factors are equivalent across languages. Failure of metric equivalence indicates that items vary in their relationship to the construct in different languages. Third, scalar equivalence — testing that the item intercepts are equivalent across languages, which is required for direct comparison of mean scores. Failure of scalar equivalence indicates that items have different difficulty levels in different languages, making mean comparisons problematic but not invalidating within-language interpretation. Item-level analyses including differential item functioning (DIF) testing using Mantel-Haenszel or item-response-theory approaches identify specific items that function differently across languages; these items can be removed, modified, or kept with caution. The data requirements for proper equivalence testing are substantial — typically several hundred respondents per language for confirmatory factor analysis, more for IRT-based DIF — which makes proper localisation expensive and slow. Career-assessment platforms supporting many languages without sample sizes adequate for equivalence testing should disclose this limitation; the alternative is shipping translations dressed as localisations.

What is JobCannon’s honest current state on multilingual assessment, and what is in the build pipeline?

JobCannon ships in English in production. The English versions of the assessment battery are based on published psychometric instruments — RIASEC drawing on Holland and the substantial subsequent literature, Big Five drawing on the Costa and McCrae NEO framework and the OCEAN literature, Multiple Intelligences drawing on Howard Gardner with caveats about the framework’s controversial standing, Maslach Burnout Inventory framework for burnout-risk, and similar published foundations for other instruments. Spanish is in active build, with translation work completed and equivalence-validation work in progress. Portuguese, Arabic, and Ukrainian are sponsorable language additions — the technical infrastructure for adding language versions is in place, but the cost of professional translation, equivalence testing, and ongoing psychometric maintenance has not yet been funded for these languages. A sponsorable language is one where an institutional partner (a workforce board, NGO, or large employer with significant population in that language) can underwrite the localisation work in exchange for early access. Other languages — French, German, Mandarin, Hindi, and others — are not currently in the pipeline. Buyers considering JobCannon for multilingual deployments should treat English as production-ready, Spanish as in build with timeline depending on equivalence-testing data accumulation, and other languages as conditional on partnership-level commitment. This honest disclosure matters because the alternative — shipping translations dressed as localised assessments — produces poor user experience, reduced validity in non-English deployments, and reputational risk for both platform and buyer when results in unsupported languages turn out to be unreliable.

How do buyers evaluate the multilingual quality of competing platforms?

A defensible evaluation has six checkpoints. First, language coverage claim — which languages does the platform claim to support, and what is the basis of the claim? Translation alone, professional translation with back-translation, professional localisation with equivalence testing, fully validated localisation with published psychometric documentation. Each level represents a different quality claim and should be supported by different evidence. Second, evidence depth per language — for each claimed-supported language, what evidence does the platform provide of the localisation work? Sample sizes, equivalence-testing methodology, item-level DIF analyses, internal consistency reliability per language, factor structure documentation. Platforms claiming wide language coverage with thin evidence per language are typically shipping translations rather than localised instruments. Third, cultural-context adaptation — beyond the language itself, has the platform adapted item content for cultural context where appropriate? Some constructs (achievement motivation, individualism / collectivism dimensions, social anxiety) are culturally loaded and direct translation produces poor results in some target cultures even with accurate language rendering. Fourth, ongoing maintenance — localised instruments require ongoing psychometric monitoring as language usage shifts, particularly in rapidly evolving cultural contexts. Platforms with one-time localisation work and no ongoing programme produce instruments that drift over time. Fifth, support infrastructure in target language — user support, instructions, help text, error messages, and result content all need to be in the target language. A localised assessment with English-only support infrastructure is operationally awkward. Sixth, reading-level alignment — instruments designed for an adult population with high-school reading level in source language may not translate to the same reading level in target languages, particularly for languages with significantly different vocabulary structure.

How do AERA / APA / NCME Standards Chapter 9 expectations affect deployment decisions?

Chapter 9 of the AERA / APA / NCME Standards for Educational and Psychological Testing (2014 edition) addresses the testing of individuals of diverse linguistic backgrounds. The chapter’s core expectations have eight components. First, the test developer should disclose the linguistic, cultural, and educational background of the populations on which the test was developed and validated. Second, when a test is administered to individuals in a non-source language, the test developer or user should provide evidence that the test functions appropriately in that language. Third, when scores are interpreted across language groups, the test developer or user should document the equivalence evidence supporting the cross-language interpretation. Fourth, when accommodations are provided to test-takers with limited proficiency in the test language, the accommodations should be supported by evidence and documented. Fifth, the test developer should consider the linguistic complexity of items and avoid unnecessary linguistic load that introduces construct-irrelevant variance. Sixth, score reports should be provided in the language most appropriate for the test-taker. Seventh, the test administration procedures should accommodate linguistically diverse populations. Eighth, the test developer should document evidence supporting the validity of the test for the intended use with the diverse linguistic population. For a buyer evaluating a multilingual career-assessment platform, the Standards Chapter 9 framework provides a structured set of questions to ask the vendor: what evidence exists per language, how scores are interpreted across language groups, whether accommodations are supported, whether reading-level and linguistic complexity are managed, whether reports are provided in the test-taker’s preferred language. The level of evidence appropriate depends on the stakes of the assessment use — low-stakes career exploration tolerates less evidence than high-stakes selection or accountability use.

What does a defensible multilingual deployment plan look like?

A defensible plan has five components. First, scope decision — which languages does the deployment actually need? Many institutional buyers default to a longer language list than the population actually requires, producing localisation cost and complexity without proportional benefit. Population data — home-language data from school enrollment, primary-language data from workforce-board intake, language-of-business preference for employer deployments — supports a focused decision. Second, evidence requirements per language — what level of evidence does the deployment require, given the stakes of the assessment use? Career-exploration use can tolerate professional translation with quality-assurance review where validated localisation is not yet available; selection or accountability use requires full validated localisation. Third, data-collection plan for languages where validated localisation is in build — the deployment should support data collection that contributes to ongoing equivalence-testing. Career-assessment platforms with multiple deployments in the same language can pool data for psychometric maintenance work that no single deployment could fund. Fourth, transparency plan — how are limitations communicated to users? Users completing assessments in languages with thin evidence should see, at result-page or methodology level, language about the limitations. Hiding limitations produces backlash when issues surface. Fifth, escalation plan — what happens when a result in a non-source language seems unreliable or when a user surfaces a localisation issue? A clear feedback mechanism and a clear path to remediation supports the deployment’s integrity over time. JobCannon’s production posture supports this plan with English-shipped, Spanish in active build, and explicit sponsorable-language framework that surfaces the work required for additional languages rather than claiming universal coverage."

Author

Peter Kolomiets

Founder & Lead Researcher, JobCannon

Peter is the founder of JobCannon and leads the assessment validation, knowledge graph, and B2B partnerships. He has 10+ years working with NGO and educational career programmes globally.