1
|
Nie L, Xu P, Hu D. Multidimensional IRT for forced choice tests: A literature review. Heliyon 2024; 10:e26884. [PMID: 38449643 PMCID: PMC10915382 DOI: 10.1016/j.heliyon.2024.e26884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 02/11/2024] [Accepted: 02/21/2024] [Indexed: 03/08/2024] Open
Abstract
The Multidimensional Forced Choice (MFC) test is frequently utilized in non-cognitive evaluations because of its effectiveness in reducing response bias commonly associated with the conventional Likert scale. Nonetheless, it is critical to recognize that the MFC test generates ipsative data, a type of measurement that has been criticized due to its limited applicability for comparing individuals. Multidimensional item response theory (MIRT) models have recently sparked renewed interest among academics and professionals. This is largely due to the development of several models that make it easier to collect normative data from forced-choice tests. The paper introduces a modeling framework made up of three key components: response format, measurement model, and decision theory. Under this paradigm, four IRT models were chosen as examples. Following that, a comprehensive study is carried out to compare and characterize the parameter estimation techniques used in MFC-IRT models. This work then examines empirical research on the concept by analyzing three distinct domains: parameter invariance testing, computerized adaptive testing (CAT), and validity investigation. Finally, it is recommended that future research initiatives follow four distinct paths: modeling, parameter invariance testing, forced-choice CAT, and validity studies.
Collapse
Affiliation(s)
- Lei Nie
- School of Public Administration, East China Normal University, China
| | - Peiyi Xu
- Department of Educational Psychology, Faculty of Education, East China Normal University, China
| | - Di Hu
- School of Education and Social Policy, Northwestern University, USA
| |
Collapse
|
2
|
Sun L, Qin Z, Wang S, Tian X, Luo F. Contributions to Constructing Forced-Choice Questionnaires Using the Thurstonian IRT Model. MULTIVARIATE BEHAVIORAL RESEARCH 2024; 59:229-250. [PMID: 37776890 DOI: 10.1080/00273171.2023.2248979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/02/2023]
Abstract
Forced-choice questionnaires involve presenting items in blocks and asking respondents to provide a full or partial ranking of the items within each block. To prevent involuntary or voluntary response distortions, blocks are usually formed of items that possess similar levels of desirability. Assembling forced-choice blocks is not a trivial process, because in addition to desirability, both the direction and magnitude of relationships between items and the traits being measured (i.e., factor loadings) need to be carefully considered. Based on simulations and empirical studies using item pairs, we provide recommendations on how to construct item pairs matched by desirability. When all pairs contain items keyed in the same direction, score reliability is improved by maximizing within-block loading differences. Higher reliability is obtained when even a small number of pairs consist of unequally keyed items.
Collapse
Affiliation(s)
- Luning Sun
- The Psychometrics Centre, University of Cambridge
| | - Zijie Qin
- Faculty of Psychology, Beijing Normal University
| | - Shan Wang
- Faculty of Psychology, Beijing Normal University
| | - Xuetao Tian
- Faculty of Psychology, Beijing Normal University
| | - Fang Luo
- Faculty of Psychology, Beijing Normal University
| |
Collapse
|
3
|
Zheng C, Liu J, Li Y, Xu P, Zhang B, Wei R, Zhang W, Liu B, Huang J. A 2PLM-RANK multidimensional forced-choice model and its fast estimation algorithm. Behav Res Methods 2024:10.3758/s13428-023-02315-x. [PMID: 38409459 DOI: 10.3758/s13428-023-02315-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/06/2023] [Indexed: 02/28/2024]
Abstract
High-stakes non-cognitive tests frequently employ forced-choice (FC) scales to deter faking. To mitigate the issue of score ipsativity derived, many scoring models have been devised. Among them, the multi-unidimensional pairwise preference (MUPP) framework is a highly flexible and commonly used framework. However, the original MUPP model was developed for unfolding response process and can only handle paired comparisons. The present study proposes the 2PLM-RANK as a generalization of the MUPP model to accommodate dominance RANK format response. In addition, an improved stochastic EM (iStEM) algorithm is devised for more stable and efficient parameter estimation. Simulation results generally supported the efficiency and utility of the new algorithm in estimating the 2PLM-RANK when applied to both triplets and tetrads across various conditions. An empirical illustration with responses to a 24-dimensional personality test further supported the practicality of the proposed model. To further aid in the application of the new model, a user-friendly R package is also provided.
Collapse
Affiliation(s)
- Chanjin Zheng
- Department of Educational Psychology, Faculty of Education, East China Normal University, Shanghai, China.
| | - Juan Liu
- Beijing Insight Online Management Consulting Co.,Ltd, Beijing, China
| | - Yaling Li
- Beijing Insight Online Management Consulting Co.,Ltd, Beijing, China
| | - Peiyi Xu
- Department of Educational Psychology, Faculty of Education, East China Normal University, Shanghai, China
- Beijing Insight Online Management Consulting Co.,Ltd, Beijing, China
| | - Bo Zhang
- School of Labor and Employment Relations and Department of Psychology, University of Illinois Urbana-Champaign, Champaign, USA
| | - Ran Wei
- Beijing Insight Online Management Consulting Co.,Ltd, Beijing, China
| | - Wenqing Zhang
- Department of Educational Psychology, Faculty of Education, East China Normal University, Shanghai, China
- Beijing Insight Online Management Consulting Co.,Ltd, Beijing, China
| | - Boyang Liu
- Beijing Insight Online Management Consulting Co.,Ltd, Beijing, China
| | - Jing Huang
- Educational Psychology and Research Methodology, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
4
|
Wang Q, Zheng Y, Liu K, Cai Y, Peng S, Tu D. Item selection methods in multidimensional computerized adaptive testing for forced-choice items using Thurstonian IRT model. Behav Res Methods 2024; 56:600-614. [PMID: 36750522 DOI: 10.3758/s13428-022-02037-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/24/2022] [Indexed: 02/09/2023]
Abstract
Multidimensional computerized adaptive testing for forced-choice items (MFC-CAT) combines the benefits of multidimensional forced-choice (MFC) items and computerized adaptive testing (CAT) in that it eliminates response biases and reduces administration time. Previous studies that explored designs of MFC-CAT only discussed item selection methods based on the Fisher information (FI), which is known to perform unstably at early stages of CAT. This study proposes a set of new item selection methods based on the KL information for MFC-CAT (namely MFC-KI, MFC-KB, and MFC-KLP) based on the Thurstonian IRT (TIRT) model. Three simulation studies, including one based on real data, were conducted to compare the performance of the proposed KL-based item selection methods against the existing FI-based methods in three- and five-dimensional MFC-CAT scenarios with various test lengths and inter-trait correlations. Results demonstrate that the proposed KL-based item selection methods are feasible for MFC-CAT and generate acceptable trait estimation accuracy and uniformity of item pool usage. Among the three proposed methods, MFC-KB and MFC-KLP outperformed the existing FI-based item selection methods and resulted in the most accurate trait estimation and relatively even utilization of the item pool.
Collapse
Affiliation(s)
- Qin Wang
- Jiangxi Normal University, Nanchang, China
| | - Yi Zheng
- Arizonal State Univerity, Tempe, AZ, USA
| | - Kai Liu
- Jiangxi Normal University, Nanchang, China
| | - Yan Cai
- Jiangxi Normal University, Nanchang, China.
| | - Siwei Peng
- Jiangxi Normal University, Nanchang, China
| | - Dongbo Tu
- Jiangxi Normal University, Nanchang, China.
| |
Collapse
|
5
|
Qiu X, de la Torre J. A dual process item response theory model for polytomous multidimensional forced-choice items. THE BRITISH JOURNAL OF MATHEMATICAL AND STATISTICAL PSYCHOLOGY 2023; 76:491-512. [PMID: 36967236 DOI: 10.1111/bmsp.12303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 07/03/2023] [Indexed: 06/18/2023]
Abstract
The use of multidimensional forced-choice (MFC) items to assess non-cognitive traits such as personality, interests and values in psychological tests has a long history, because MFC items show strengths in preventing response bias. Recently, there has been a surge of interest in developing item response theory (IRT) models for MFC items. However, nearly all of the existing IRT models have been developed for MFC items with binary scores. Real tests use MFC items with more than two categories; such items are more informative than their binary counterparts. This study developed a new IRT model for polytomous MFC items based on the cognitive model of choice, which describes the cognitive processes underlying humans' preferential choice behaviours. The new model is unique in its ability to account for the ipsative nature of polytomous MFC items, to assess individual psychological differentiation in interests, values and emotions, and to compare the differentiation levels of latent traits between individuals. Simulation studies were conducted to examine the parameter recovery of the new model with existing computer programs. The results showed that both statement parameters and person parameters were well recovered when the sample size was sufficient. The more complete the linking of the statements was, the more accurate the parameter estimation was. This paper provides an empirical example of a career interest test using four-category MFC items. Although some aspects of the model (e.g., the nature of the person parameters) require additional validation, our approach appears promising.
Collapse
Affiliation(s)
- Xuelan Qiu
- Institute for Learning Sciences & Teacher Education, Australian Catholic University, Brisbane, Queensland, Australia
| | | |
Collapse
|
6
|
Tu N, Joo S, Lee P, Stark S. Comparison of parameter estimation approaches for multi-unidimensional pairwise preference tests. Behav Res Methods 2023; 55:2764-2786. [PMID: 35931936 DOI: 10.3758/s13428-022-01927-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/07/2022] [Indexed: 11/08/2022]
Abstract
Multidimensional forced-choice (MFC) testing has been proposed as a way of reducing response biases in noncognitive measurement. Although early item response theory (IRT) research focused on illustrating that person parameter estimates with normative properties could be obtained using various MFC models and formats, more recent attention has been devoted to exploring the processes involved in test construction and how that influences MFC scores. This research compared two approaches for estimating multi-unidimensional pairwise preference model (MUPP; Stark et al., 2005) parameters based on the generalized graded unfolding model (GGUM; Roberts et al., 2000). More specifically, we compared the efficacy of statement and person parameter estimation based on a "two-step" process, developed by Stark et al. (2005), with a more recently developed "direct" estimation approach (Lee et al., 2019) in a Monte Carlo study that also manipulated test length, test dimensionality, sample size, and the correlations between generating person parameters for each dimension. Results indicated that the two approaches had similar scoring accuracy, although the two-step approach had better statement parameter recovery than the direct approach. Limitations, implications for MFC test construction and scoring, and recommendations for future MFC research and practice are discussed.
Collapse
Affiliation(s)
- Naidan Tu
- Department of Psychology, University of South Florida, Tampa, FL, USA.
| | - Sean Joo
- Department of Educational Psychology, University of Kansas, Lawrence, KS, USA
| | - Philseok Lee
- Department of Psychology, George Mason University, Fairfax, VA, USA
| | - Stephen Stark
- Department of Psychology, University of South Florida, Tampa, FL, USA
| |
Collapse
|
7
|
Kreitchmann RS, Sorrel MA, Abad FJ. On Bank Assembly and Block Selection in Multidimensional Forced-Choice Adaptive Assessments. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2023; 83:294-321. [PMID: 36866066 PMCID: PMC9972126 DOI: 10.1177/00131644221087986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Multidimensional forced-choice (FC) questionnaires have been consistently found to reduce the effects of socially desirable responding and faking in noncognitive assessments. Although FC has been considered problematic for providing ipsative scores under the classical test theory, item response theory (IRT) models enable the estimation of nonipsative scores from FC responses. However, while some authors indicate that blocks composed of opposite-keyed items are necessary to retrieve normative scores, others suggest that these blocks may be less robust to faking, thus impairing the assessment validity. Accordingly, this article presents a simulation study to investigate whether it is possible to retrieve normative scores using only positively keyed items in pairwise FC computerized adaptive testing (CAT). Specifically, a simulation study addressed the effect of (a) different bank assembly (with a randomly assembled bank, an optimally assembled bank, and blocks assembled on-the-fly considering every possible pair of items), and (b) block selection rules (i.e., T, and Bayesian D and A-rules) over the estimate accuracy and ipsativity and overlap rates. Moreover, different questionnaire lengths (30 and 60) and trait structures (independent or positively correlated) were studied, and a nonadaptive questionnaire was included as baseline in each condition. In general, very good trait estimates were retrieved, despite using only positively keyed items. Although the best trait accuracy and lowest ipsativity were found using the Bayesian A-rule with questionnaires assembled on-the-fly, the T-rule under this method led to the worst results. This points out to the importance of considering both aspects when designing FC CAT.
Collapse
|
8
|
Lin Y, Brown A, Williams P. Multidimensional Forced-Choice CAT With Dominance Items: An Empirical Comparison With Optimal Static Testing Under Different Desirability Matching. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2023; 83:322-350. [PMID: 36866068 PMCID: PMC9972128 DOI: 10.1177/00131644221077637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Several forced-choice (FC) computerized adaptive tests (CATs) have emerged in the field of organizational psychology, all of them employing ideal-point items. However, despite most items developed historically follow dominance response models, research on FC CAT using dominance items is limited. Existing research is heavily dominated by simulations and lacking in empirical deployment. This empirical study trialed a FC CAT with dominance items described by the Thurstonian Item Response Theory model with research participants. This study investigated important practical issues such as the implications of adaptive item selection and social desirability balancing criteria on score distributions, measurement accuracy and participant perceptions. Moreover, nonadaptive but optimal tests of similar design were trialed alongside the CATs to provide a baseline for comparison, helping to quantify the return on investment when converting an otherwise-optimized static assessment into an adaptive one. Although the benefit of adaptive item selection in improving measurement precision was confirmed, results also indicated that at shorter test lengths CAT had no notable advantage compared with optimal static tests. Taking a holistic view incorporating both psychometric and operational considerations, implications for the design and deployment of FC assessments in research and practice are discussed.
Collapse
Affiliation(s)
- Yin Lin
- University of Kent, Canterbury,
UK
- SHL, Thames Ditton, Surrey, UK
| | | | | |
Collapse
|
9
|
Joo SH, Lee P, Stark S. Modeling Multidimensional Forced Choice Measures with the Zinnes and Griggs Pairwise Preference Item Response Theory Model. MULTIVARIATE BEHAVIORAL RESEARCH 2023; 58:241-261. [PMID: 34370564 DOI: 10.1080/00273171.2021.1960142] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
This research developed a new ideal point-based item response theory (IRT) model for multidimensional forced choice (MFC) measures. We adapted the Zinnes and Griggs (ZG; 1974) IRT model and the multi-unidimensional pairwise preference (MUPP; Stark et al., 2005) model, henceforth referred to as ZG-MUPP. We derived the information function to evaluate the psychometric properties of MFC measures and developed a model parameter estimation algorithm using Markov chain Monte Carlo (MCMC). To evaluate the efficacy of the proposed model, we conducted a simulation study under various experimental conditions such as sample sizes, number of items, and ranges of discrimination and location parameters. The results showed that the model parameters were accurately estimated when the sample size was as low as 500. The empirical results also showed that the scores from the ZG-MUPP model were comparable to those from the MUPP model and the Thurstonian IRT (TIRT) model. Practical implications and limitations are further discussed.
Collapse
|
10
|
Huang HY. Diagnostic Classification Model for Forced-Choice Items and Noncognitive Tests. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2023; 83:146-180. [PMID: 36601255 PMCID: PMC9806518 DOI: 10.1177/00131644211069906] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The forced-choice (FC) item formats used for noncognitive tests typically develop a set of response options that measure different traits and instruct respondents to make judgments among these options in terms of their preference to control the response biases that are commonly observed in normative tests. Diagnostic classification models (DCMs) can provide information regarding the mastery status of test takers on latent discrete variables and are more commonly used for cognitive tests employed in educational settings than for noncognitive tests. The purpose of this study is to develop a new class of DCM for FC items under the higher-order DCM framework to meet the practical demands of simultaneously controlling for response biases and providing diagnostic classification information. By conducting a series of simulations and calibrating the model parameters with a Bayesian estimation, the study shows that, in general, the model parameters can be recovered satisfactorily with the use of long tests and large samples. More attributes improve the precision of the second-order latent trait estimation in a long test, but decrease the classification accuracy and the estimation quality of the structural parameters. When statements are allowed to load on two distinct attributes in paired comparison items, the specific-attribute condition produces better a parameter estimation than the overlap-attribute condition. Finally, an empirical analysis related to work-motivation measures is presented to demonstrate the applications and implications of the new model.
Collapse
Affiliation(s)
- Hung-Yu Huang
- University of Taipei, Taiwan
- Hung-Yu Huang, Distinguished Professor,
Department of Psychology and Counseling, University of Taipei, No.1, Ai-Guo West
Road, Taipei, 10048, Taiwan.
| |
Collapse
|
11
|
Qiu XL, de la Torre J, Ro S, Wang WC. Computerized Adaptive Testing for Ipsative Tests with Multidimensional Pairwise-Comparison Items: Algorithm Development and Applications. APPLIED PSYCHOLOGICAL MEASUREMENT 2022; 46:255-272. [PMID: 35601264 PMCID: PMC9118927 DOI: 10.1177/01466216221084209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A computerized adaptive testing (CAT) solution for tests with multidimensional pairwise-comparison (MPC) items, aiming to measure career interest, value, and personality, is rare. This paper proposes new item selection and exposure control methods for CAT with dichotomous and polytomous MPC items and present simulation study results. The results show that the procedures are effective in selecting items and controlling within-person statement exposure with no loss of efficiency. Implications are discussed in two applications of the proposed CAT procedures: a work attitude test with dichotomous MPC items and a career interest assessment with polytomous MPC items.
Collapse
|
12
|
Chen CW, Wang WC, Mok MMC, Scherer R. A Lognormal Ipsative Model for Multidimensional Compositional Items. Front Psychol 2021; 12:573252. [PMID: 34712161 PMCID: PMC8545823 DOI: 10.3389/fpsyg.2021.573252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 09/14/2021] [Indexed: 11/13/2022] Open
Abstract
Compositional items – a form of forced-choice items – require respondents to allocate a fixed total number of points to a set of statements. To describe the responses to these items, the Thurstonian item response theory (IRT) model was developed. Despite its prominence, the model requires that items composed of parts of statements result in a factor loading matrix with full rank. Without this requirement, the model cannot be identified, and the latent trait estimates would be seriously biased. Besides, the estimation of the Thurstonian IRT model often results in convergence problems. To address these issues, this study developed a new version of the Thurstonian IRT model for analyzing compositional items – the lognormal ipsative model (LIM) – that would be sufficient for tests using items with all statements positively phrased and with equal factor loadings. We developed an online value test following Schwartz’s values theory using compositional items and collected response data from a sample size of N = 512 participants with ages from 13 to 51 years. The results showed that our LIM had an acceptable fit to the data, and that the reliabilities exceeded 0.85. A simulation study resulted in good parameter recovery, high convergence rate, and the sufficient precision of estimation in the various conditions of covariance matrices between traits, test lengths and sample sizes. Overall, our results indicate that the proposed model can overcome the problems of the Thurstonian IRT model when all statements are positively phrased and factor loadings are similar.
Collapse
Affiliation(s)
- Chia-Wen Chen
- Centre for Educational Measurement, University of Oslo, Oslo, Norway
| | - Wen-Chung Wang
- Assessment Research Centre, The Education University of Hong Kong, Tai Po, Hong Kong, SAR China
| | - Magdalena Mo Ching Mok
- Assessment Research Centre, The Education University of Hong Kong, Tai Po, Hong Kong, SAR China.,Graduate Institute of Educational Information and Measurement, National Taichung University of Education, Taichung, Taiwan
| | - Ronny Scherer
- Centre for Educational Measurement, University of Oslo, Oslo, Norway
| |
Collapse
|
13
|
A genetic algorithm for optimal assembly of pairwise forced-choice questionnaires. Behav Res Methods 2021; 54:1476-1492. [PMID: 34505277 PMCID: PMC9170671 DOI: 10.3758/s13428-021-01677-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/25/2021] [Indexed: 11/26/2022]
Abstract
The use of multidimensional forced-choice questionnaires has been proposed as a means of improving validity in the assessment of non-cognitive attributes in high-stakes scenarios. However, the reduced precision of trait estimates in this questionnaire format is an important drawback. Accordingly, this article presents an optimization procedure for assembling pairwise forced-choice questionnaires while maximizing posterior marginal reliabilities. This procedure is performed through the adaptation of a known genetic algorithm (GA) for combinatorial problems. In a simulation study, the efficiency of the proposed procedure was compared with a quasi-brute-force (BF) search. For this purpose, five-dimensional item pools were simulated to emulate the real problem of generating a forced-choice personality questionnaire under the five-factor model. Three factors were manipulated: (1) the length of the questionnaire, (2) the relative item pool size with respect to the questionnaire’s length, and (3) the true correlations between traits. The recovery of the person parameters for each assembled questionnaire was evaluated through the squared correlation between estimated and true parameters, the root mean square error between the estimated and true parameters, the average difference between the estimated and true inter-trait correlations, and the average standard error for each trait level. The proposed GA offered more accurate trait estimates than the BF search within a reasonable computation time in every simulation condition. Such improvements were especially important when measuring correlated traits and when the relative item pool sizes were higher. A user-friendly online implementation of the algorithm was made available to the users.
Collapse
|
14
|
Qiu XL, Wang WC. Assessment of Differential Statement Functioning in Ipsative Tests With Multidimensional Forced-Choice Items. APPLIED PSYCHOLOGICAL MEASUREMENT 2021; 45:79-94. [PMID: 33627915 PMCID: PMC7876635 DOI: 10.1177/0146621620965739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Ipsative tests with multidimensional forced-choice (MFC) items have been widely used to assess career interest, values, and personality to prevent response biases. Recently, there has been a surge of interest in developing item response theory models for MFC items. In reality, a statement in an MFC item may have different utilities for different groups, which is referred to as differential statement functioning (DSF). However, few studies have been investigated methods for detecting DSF owing to the challenges related to the features of ipsative tests. In this study, three methods were adapted for DSF assessment in MFC items: equal-mean-utility (EMU), all-other-statement (AOS), and constant-statement (CS). Simulation studies were conducted to evaluate the recovery of parameters and the performance of the proposed methods. Results showed that statement parameters and DSF parameters were well recovered for all the three methods when the test did not contain any DSF statement. When the test contained one or more DSF statements, only the CS method yielded accurate estimates. With respect to DSF assessment, both the EMU method using the bootstrap standard error and the AOS method performed appropriately so long as the test did not contain any DSF statement. The CS method performed well in cases where one or more DSF-free statements were chosen as an anchor. The longer the anchor statements, the higher the power of DSF detection.
Collapse
Affiliation(s)
- Xue-Lan Qiu
- The University of Hong Kong, Pok Fu Lam,
Hong Kong
| | - Wen-Chung Wang
- The Education University of Hong Kong,
New Territories, Hong Kong
| |
Collapse
|
15
|
Adaptive testing with the GGUM-RANK multidimensional forced choice model: Comparison of pair, triplet, and tetrad scoring. Behav Res Methods 2020; 52:761-772. [PMID: 31342469 DOI: 10.3758/s13428-019-01274-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Likert-type measures have been criticized in psychological assessment because they are vulnerable to response biases, including central tendency, acquiescence, leniency, halo, and socially desirable responding. As an alternative, multidimensional forced choice (MFC) testing has been proposed to address these concerns. A number of researchers have developed item response theory (IRT) models for MFC data and have examined latent trait estimation with tests of different dimensionality and length. Research has also explored the advantages of computerized adaptive testing (CAT) with MFC pair tests having as many as 25 dimensions, but there have been no published studies on CAT with MFC triplets or tetrads. Thus, in this research we aimed to address that issue. We used recently developed item information functions for an MFC ranking model to compare the benefits of CAT with MFC pair, triplet, and tetrad tests. A simulation study showed that CAT substantially outperformed nonadaptive testing for latent trait estimation across MFC formats. More importantly, CAT with MFC pairs provided estimation accuracy similar to or better than that from tests of equivalent numbers of nonadaptive MFC triplets. On the basis of these findings, implications and recommendations are further discussed for constructing MFC measures to use in psychological contexts.
Collapse
|
16
|
Lee P, Joo SH, Stark S. Detecting DIF in Multidimensional Forced Choice Measures Using the Thurstonian Item Response Theory Model. ORGANIZATIONAL RESEARCH METHODS 2020. [DOI: 10.1177/1094428120959822] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Although modern item response theory (IRT) methods of test construction and scoring have overcome ipsativity problems historically associated with multidimensional forced choice (MFC) formats, there has been little research on MFC differential item functioning (DIF) detection, where item refers to a block, or group, of statements presented for an examinee’s consideration. This research investigated DIF detection with three-alternative MFC items based on the Thurstonian IRT (TIRT) model, using omnibus Wald tests on loadings and thresholds. We examined constrained and free baseline model comparisons strategies with different types and magnitudes of DIF, latent trait correlations, sample sizes, and levels of impact in an extensive Monte Carlo study. Results indicated the free baseline strategy was highly effective in detecting DIF, with power approaching 1.0 in the large sample size and large magnitude of DIF conditions, and similar effectiveness in the impact and no-impact conditions. This research also included an empirical example to demonstrate the viability of the best performing method with real examinees and showed how a DIF and a DTF effect size measure can be used to assess the practical significance of MFC DIF findings.
Collapse
|
17
|
Ng V, Lee P, Ho MHR, Kuykendall L, Stark S, Tay L. The Development and Validation of a Multidimensional Forced-Choice Format Character Measure: Testing the Thurstonian IRT Approach. J Pers Assess 2020; 103:224-237. [DOI: 10.1080/00223891.2020.1739056] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Vincent Ng
- Department of Psychology, University of Houston, Houston, Texas
| | - Philseok Lee
- Department of Psychology, George Mason University, Fairfax, Virginia
| | - Moon-Ho Ringo Ho
- School of Humanities and Social Sciences, Nanyang Technological University, Singapore
| | - Lauren Kuykendall
- Department of Psychology, George Mason University, Fairfax, Virginia
| | - Stephen Stark
- Department of Psychology, University of South Florida, Tampa, Florida
| | - Louis Tay
- Department of Psychological Sciences, Purdue University, West Lafayette, Indiana
| |
Collapse
|
18
|
Bürkner PC, Schulte N, Holling H. On the Statistical and Practical Limitations of Thurstonian IRT Models. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2019; 79:827-854. [PMID: 31488915 PMCID: PMC6713979 DOI: 10.1177/0013164419832063] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating forced-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to five traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor for Bayesian estimation methods. As a result, persons' trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizable number of traits.
Collapse
|
19
|
Chen C, Wang W, Chiu MM, Ro S. Item Selection and Exposure Control Methods for Computerized Adaptive Testing with Multidimensional Ranking Items. JOURNAL OF EDUCATIONAL MEASUREMENT 2019. [DOI: 10.1111/jedm.12252] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
20
|
Walton KE, Cherkasova L, Roberts RD. On the Validity of Forced Choice Scores Derived From the Thurstonian Item Response Theory Model. Assessment 2019; 27:706-718. [PMID: 31007043 DOI: 10.1177/1073191119843585] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Forced choice (FC) measures may be a desirable alternative to single stimulus (SS) Likert items, which are easier to fake and can have associated response biases. However, classical methods of scoring FC measures lead to ipsative data, which have a number of psychometric problems. A Thurstonian item response theory (TIRT) model has been introduced as a way to overcome these issues, but few empirical validity studies have been conducted to ensure its effectiveness. This was the goal of the current three studies, which used FC measures of domains from popular personality frameworks including the Big Five and HEXACO, and both statement and adjective item stems. We computed TIRT and ipsative scores and compared their validity estimates. Convergent and discriminant validity of the scores were evaluated by correlating them with SS scores, and test-criterion validity evidence was evaluated by examining their relationships with meaningful outcomes. In all three studies, there was evidence for the convergent and test-criterion validity of the TIRT scores, though at times this was on par with the validity of the ipsative scores. The discriminant validity of the TIRT scores was problematic and was often worse than the ipsative scores.
Collapse
Affiliation(s)
| | | | - Richard D Roberts
- Research and Assessment Design (RAD): Science Solution, Philadelphia, PA, USA
| |
Collapse
|
21
|
Wang C, Weiss DJ. Multivariate Hypothesis Testing Methods for Evaluating Significant Individual Change. APPLIED PSYCHOLOGICAL MEASUREMENT 2018; 42:221-239. [PMID: 29881123 PMCID: PMC5985704 DOI: 10.1177/0146621617726787] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The measurement of individual change has been an important topic in both education and psychology. For instance, teachers are interested in whether students have significantly improved (e.g., learned) from instruction, and counselors are interested in whether particular behaviors have been significantly changed after certain interventions. Although classical test methods have been unable to adequately resolve the problems in measuring change, recent approaches for measuring change have begun to use item response theory (IRT). However, all prior methods mainly focus on testing whether growth is significant at the group level. The present research targets a key research question: Is the "change" in latent trait estimates for each individual significant across occasions? Many researchers have addressed this research question assuming that the latent trait is unidimensional. This research generalizes their earlier work and proposes four hypothesis testing methods to evaluate individual change on multiple latent traits: a multivariate Z-test, a multivariate likelihood ratio test, a multivariate score test, and a Kullback-Leibler test. Simulation results show that these tests hold promise of detecting individual change with low Type I error and high power. A real-data example from an educational assessment illustrates the application of the proposed methods.
Collapse
Affiliation(s)
- Chun Wang
- University of Minnesota, Minneapolis, MN, USA
| | | |
Collapse
|