Item nonresponses in educational and psychological measurement
The question of how to handle missing responses in psychological and educational measurement has been repeatedly and controversially debated for decades. Even in highly respected international studies and large scale assessments, such as the PISA (Program for International Student Assessment), TIMSS (Third International Mathematics and Science Study), and PIRLS (Progress in International Reading Literacy Study) a generally accepted methodology for missing data is still lacking. Since the late 1990s multidimensional item response theory (MIRT) models for item nonresponses have been developed. These models become quickly complex in application and rest upon assumptions that are usually not critically addressed. Although statistical theory of missing data suggests adequate handling of missing responses to avoid inefficient and biased parameter estimation, there is empirical evidence that IRT-based parameter estimation is fairly robust against missing responses. That may question the need for sophisticated IRT model-based approaches. For that reason this thesis consists of two major parts. After the introduction of the missing data theory in the context of educational and psychological measurement, the impact of item nonresponses to item- and person parameter estimates are examined in the first part. In the second part existing approaches to handle missing responses are scrutinized and further developed. The different methods are critically compared and recommendations will be given as to which approaches are appropriate. The considerations are confined to dichotomous items that are still common in many tests and assessments. The impact of missing responses to item and person parameter estimates was shown analytically and empirically using simulated data. The results show clearly that ignoring systematic missing data leads to biased item and person parameter estimates in IRT models. The findings highlight the need for appropriate methods to handle item nonresponses properly. It could be shown that simple ad-hoc methods such as incorrect answer substitution (IAS) or partially correct scoring (PCS) are not justifiable theoretically and threaten the test fairness and the validity of test results. The nominal response model (NRM) for item nonresponses is an alternative approach that was examined. In this model item nonresponse are regarded as an additional response category. However, the NRM rests upon strong assumptions and, therefore, its applicability is limited. MIRT models for missing responses rank among modern model-based approaches. The underlying rationale of these models is outlined in detail. It could be shown that MIRT models for item nonresponses are special cases of selection models and pattern mixture models for latent trait models with particular assumptions. Different MIRT models are discussed in the literature and are typically regarded to be equivalent. Two classes of MIRT models can be distinguished: between- and within-item multidimensional IRT models. In this thesis it is shown that these models are not equivalent per se. Typically, the question of model equivalence is considered with respect to the model-fit. In models for item nonresponses the criterion of model-fit is insufficient to judge equivalence of alternative models. The equivalence in the construction of the latent variable of interest and the bias reduction are additional criteria that need to be considered. A common framework of IRT models for item nonresponses is presented. Different between- and within-item multidimensional IRT models are rationally developed, taking the issue of model equivalence into account. Between-item multidimensional models are easy to specify and to interpret and are recommended as the models of choice. The disadvantage of MIRT models for item nonresponses is their complexity. Besides the latent variables of theoretical interest, a latent response propensity is introduced to model the missing data mechanism. Typically, unidimensionality of the latent response propensity is assumed in application. This is a strong and often untested assumption. It could be demonstrated that MIRT models fail to correct for missing data if multidimensionality of the latent response propensity is not taken into account. Hence, the number of manifest and latent variables can become fairly large in MIRT models for item nonresponses. Unfortunately, high-dimensional MIRT models are still computationally challenging. For that reason more parsimonious and less demanding latent regression IRT models and multiple group IRT models are derived as an alternative. The relationship between these models and the MIRT models is demonstrated. Finally, it is shown that missing responses due to omitted and not-reached items have different properties suggesting different treatments of them in IRT measurement models. Whereas omitted responses can be appropriately handled by MIRT models, not-reached items need to be taken into account by latent regression models. Since real data sets typically suffer from both, omitted and not-reached items, a joint model is introduced that account for both types of missing responses. The thesis ends with a final discussion in which the findings are summarized and recommendations for real applications are given. Unsolved problems and remaining research questions are outlined.