The present study examined the reliability of the reading, listening, speaking, and writing section scores for the TOEFL iBT® test and their interrelationship in order to collect empirical evidence to support, respectively, the generalization inference and the explanation inference in the TOEFL iBT validity argument (Chapelle, Enright, & Jamieson, 2008). By combining Haberman’s (2008) subscore analysis and confirmatory factor analysis (CFA), data from four operational TOEFL iBT test administrations were analyzed for all examinees and three major native language (L1) groups (Arabic, Korean, and Spanish). Key results were consistent across the forms and samples. First, Haberman’s (2008) subscore analysis suggested that the reliabilities of the section scores were generally satisfactory but for the writing section the reliability was relatively low. Second, Haberman’s subscore analysis and CFA offered different degrees of support for the distinctness of the TOEFL iBT section scores. A subsequent multiple-group CFA based on a correlated four-factor model generally supported the measurement invariance across the L1 groups in terms of factor loadings as well as indicator residuals and intercepts, despite the population heterogeneity indicated by the partial invariance of the latent factor variances and differences in the latent factor means across the groups. In addition, Haberman’s subscore analysis suggested that the speaking section score offered value-added information owing to its generally high level of reliability and relative distinctness from the other three section scores, which is relevant to the utilization inference in the validity argument from a perspective of psychometric quality of the TOEFL iBT section scores.
ASJC Scopus subject areas