Biased Bivariate Correlations in Combined Survey Data Measured With Different Instruments
Authors
Abstract
Social scientists increasingly form composite datasets using data from different survey programs, which often use different single-question instruments to measure the same latent construct. This creates an obstacle when we want to run analyses using the combined data, since the scores measured with different instruments are not necessarily comparable. In this paper, we explore one consequence of such comparability problems. Specifically, we examine the case where instruments measuring the same construct have different item difficulties. This means if we applied the instruments to the same population, we would get different mean responses. If such mean differences are not mitigated before combining data, we introduce a mean bias into our composite data. Such mean bias has direct consequences for analyses based on the combined data. In data drawn from the same population, mean bias introduces error variance. In data drawn from different populations it would bias or even invert true population differences. However, in this paper I demonstrate that mean bias can also bias bivariate correlations if one or both variables in a composite dataset are subject to mean bias. If differences in item difficulty are not mitigated before combining data, we introduce a variant of Simpson’s paradox into our data: The bivariate correlation in each source survey might differ substantially from the correlation in the composite dataset. In a set of systematic simulations, I demonstrate this correlation bias effect and show how it changes depending on the mean biases in each variable and the strength of the underlying true correlation.