Biased Bivariate Correlations in Combined Survey Data Measured With Different Instruments

Social scientists increasingly form composite datasets using data from different survey programs, which often use different single-question instruments to measure the same latent construct. This creates an obstacle when we want to run analyses using the combined data, since the scores measured with different instruments are not necessarily comparable. In this paper, we explore one consequence of such comparability problems. Specifically, we examine the case where instruments measuring the same construct have different item difficulties. This means if we applied the instruments to the same population, we would get different mean responses. If such mean differences are not mitigated before combining data, we introduce a mean bias into our composite data. Such mean bias has direct consequences for analyses based on the combined data. In data drawn from the same population, mean bias introduces error variance. In data drawn from different populations it would bias or even invert true population differences. However, in this paper I demonstrate that mean bias can also bias bivariate correlations if one or both variables in a composite dataset are subject to mean bias. If differences in item difficulty are not mitigated before combining data, we introduce a variant of Simpson’s paradox into our data: The bivariate correlation in each source survey might differ substantially from the correlation in the composite dataset. In a set of systematic simulations, I demonstrate this correlation bias effect and show

Surveys in the social sciences often use single-question instruments to measure latent constructs, such as attitudes, values, interests, or emotions (Tourangeau et al., 2000).Furthermore, the same construct is often measured with different instruments in dif ferent surveys (Tomescu-Dubrow & Slomczynski, 2016).Instruments might differ in their wording, response option labels, number of response options, or other design aspects.This instrument diversity is challenging when we want to combine data from different survey sources to be used in a joint analysis (Singh, 2020;Tomescu-Dubrow & Slomczynski, 2016).And such so called ex-post harmonization projects research projects are becoming increasingly common: From international comparative research (Dubrow & Tomescu-Dubrow, 2016;Durand et al., 2021;May et al., 2021), to integrative meta-anal yses (Hussong et al., 2013), to research projects integrating national data on specific substantive topics (Schulz et al., 2022).
To mitigate comparability issues due to instrument diversity, such research projects must employ ex-post harmonization techniques (Granda et al., 2010).In other words, researchers have to carefully select, prepare, transform, and combine source data to create a homogeneous target data set.For example, when researchers aim to harmonize single-question instruments for the same latent construct, they have to ensure that the instruments do in fact measure the same construct and they have to assess the reliability (i.e., robustness against random error) of each instrument, to avoid introducing bias through attenuation into their harmonized dataset (Kolen & Brennan, 2014;Singh, 2022).
However, even if two instruments were perfect, error-free measures of the same construct, the researchers still face the challenge of aligning the units of measurement (i.e., scales) of their different instruments (Kolen & Brennan, 2014;Price, 2017).Instru ments tend to differ in the numerical scale with which they represent a construct in the source data.This is not an error, per se.Latent constructs have no natural units and we can use arbitrary scales to represent latent construct intensity (i.e., respondents' positions on a latent dimension) numerically (Price, 2017).This is easiest to see when we compare instruments with a different number of response options (i.e., scale points).If we measure the same population with a four-point scale or an eleven-point scale, we will most likely measure a higher average response and standard deviation with the eleven-point scale.This is because we scale (or map) the same construct intensities onto a different numerical scheme.However, the number of scale-points is only one factor of many.The measurement units also depend on the question wording, the response labels, the visual layout or any number of other design characteristics (Price, 2017;Tourangeau et al., 2000).
This paper aims to demonstrate how insufficient harmonization efforts can cause substantive and complex bias in our subsequent analyses using the combined data.Specifically, we explore the last link in a chain of biases.If we do not properly align measurement units (i.e., scales) of different measurement instruments before combining the data, we often incur a mean bias in the combined data (Kolen & Brennan, 2014).With mean bias I mean that two instruments applied to the same population would result in systematically different average response scores.This is problematic, because it might introduce spurious population differences and needless error variance in analyses with the combined data.However, in this paper I set out to demonstrate that such mean biases have another, less intuitive consequence: They also bias correlations based on the combined data.Using a three-dimensional matrix of simulated bivariate correlations with varying degrees of mean bias, the paper sounds out the extent of this bias.Since the ma trix of simulation varies mean bias in both variables separately, as well as the underlying unbiased correlation, we can also explore how the resulting mean bias depends on these three factors.In sum, the result of these simulations informs harmonization practitioners about the potential extent and shape of this often-overlooked form of bias in combined (survey) data.

Mean Bias
A substantive problem caused by incomparable and insufficiently harmonized measure ment units is mean bias (Kolen & Brennan, 2014;Singh, 2022).Mean bias is best un derstood when we consider the following thought experiment.Imagine applying two instruments X and Y to sufficiently large random samples A and B of the same popula tion.If ex-post harmonization was successful, we would expect the mean responses to be approximately equal: X − ≈ Y − .This is because the average true score should be the approximately the same in two random samples of the same population: τ − A ≈ τ − B (Kolen & Brennan, 2014;Singh, 2020).However, without adequate ex-post harmonization, we might find that the average response to two congeneric instruments for the same construct differs by some constant d: X − = Y − + d (Price, 2017;Raykov & Marcoulides, 2011).In other words, combining data across the two instruments introduces a mean bias d.This can easily occur, if two instruments for the same construct have different item difficulties (Moosbrugger & Kelava, 2012).Respondents may find one instrument wording easier to agree to then the alternative and thus for the same population of respondents would choose a higher mean response on one instrument than the other.
In practical terms this means that even after (insufficient) harmonization, an average respondent for our measured population would be represented by different numerical scores in our combined data (Kolen & Brennan, 2014).Of course, mean bias can also occur if instruments X and Y are applied to different populations.However, in such cases we cannot easily isolate the bias for single-question instruments, because the difference between X − and Y − is then a composite of the true construct difference τ − A − τ − B and the bias.Again, in practical terms, this means the mean differences between the populations are either over-or underestimated by an amount proportional to the mean bias (Kolen & Brennan, 2014).For a concrete example, consider two very similar measures of political interest.In Germany, the International Social Survey Programme (ISSP) is fielded together with the German General Social Survey (ALLBUS).In 2014, both asked the respondents about their level of political interest.The ALLBUS asked "How strongly are you interested in politics" with a five-point scale (GESIS-Leibniz-Institut Für Sozialwissenschaften, 2018), and the ISSP asked "How interested would you say you are in politics?"with a four-point scale (ISSP Research Group, 2016).Due to the different number of response options alone, we would expect a different mean response.And indeed, the mean response differed by Cohen's d = 0.69 (Singh, 2022).To align these differences in scale points (and thus scale range), harmonization practitioners then often apply linear stretching (Cohen et al., 1999;de Jonge et al., 2017;Durand et al., 2021).This means that the scale ranges (i.e., maximum score minus minimum score) are aligned, by setting the minimum responses and the maximum responses as equal across instruments and then stretching all scores equidistantly in between.In our example, we would linearly stretch the scores 1, 2, 3, 4, of the four-point ISSP instrument to 1, 2.33, 3.67, and 5.However, after this transformation we still find that the average responses differed by d = 0.38 between the two instruments (Singh, 2022).As it turns out, the two instruments differed in more than their scale range.Additionally, we find that both instruments have different item difficulties of P = 43 for the ALLBUS instrument and P = 33 for the ISSP instrument (Singh, 2022).In other words, the average respondent chose a score that was 43% along the range from 1 to 5 in the ALLBUS instrument but chose a score that was only 33% along the range from 1 to 4 in the ISSP instrument (Moosbrugger & Kelava, 2012;Singh, 2022).However, linear stretching only aligns the scale ranges but not the position of the average response within the scale range.Thus, differences in item difficulty between two instruments remain untouched and can cause a substantive remaining mean bias when data from the two instruments are combined.
Such differences in difficulty between single-question instruments can be mitigated if we apply more suitable harmonization method than linear stretching.One example is observed score equating in a random groups design: OSE-RG (Kolen & Brennan, 2014;Singh, 2022).Alternatively, such difficulty differences can also be mitigated via multiple imputation.However, in this paper, we want to explore what happens if we fail to mitigate mean bias.Or in terms of our example, what would have happened if we had only used linear stretching.After all, harmonization practitioners might be unaware of the limitations of linear stretching.Or they may find the more suitable harmonization techniques unfeasible in their projects.After all, both approaches have data requirements that are not always easy to meet.OSE-RG, for example, requires random groups data; that is samples for both instruments drawn randomly from the same population.Harmo nizing two instruments for the same construct with multiple imputation, meanwhile, requires a calibration sample (Siddique et al., 2015): That is a sample in which each respondent answered both instruments, but in a way that does not lead to question order effects.
The first consequence of mean bias in composite data is straightforward: Under mean bias, scores derived from different instruments are biased by an additive constant.In data drawn from the same population, this introduces error variance.And even more worrisome, in data from different populations measured with different instruments, the mean bias is mingled with true population differences.Thus, we can no longer be certain that if two populations differ on average, that this is a true population difference.Instead, it could be a methodological artifact.

Biased Bivariate Correlations due to Mean Bias
In this paper, however, I will demonstrate with systematic simulations that bivariate correlations between two variables in a harmonized dataset can be biased as well, if one or both variables are subject to mean bias.This might seem surprising, because Pear son product-moment correlations are unaffected by linear transformations of variable scores.Specifically, an additive constant d would not change the correlation coefficient r, because adding a constant to each score changes the arithmetic mean with the same constant (adapted from Gill, 2008): Thus, the Pearson product moment correlation formula (Gill, 2008) remains unchanged by an additive constant applied to all values of x (or y): However, this intuition is misleading, because in harmonization we combine data from different sources.In our combined dataset, a vector of responses for a construct would thus be composed of scores derived from different instruments.Consequently, mean bias does not add a constant to the whole variable, as above.Instead, mean bias adds different additive biases to different segments of the combined variable.The argument above obviously no longer holds if we add a constant d to some x i , but not all.Imagine the following simplified combined dataset with data from surveys A and B. Both surveys measured the constructs P and Q.However, both surveys used different instruments for each construct, thus leading to a total of four instruments.If we combine these data, we arrive at the following combined dataset structure: a dataset with two variables, one for construct P and one for construct Q. Crucially, each construct variable (i.e., vector) is a composite of values from two surveys and thus two instruments.In summary, there are thus three vectors for construct P: P A from survey A, P B from survey B, and P C as the composite vector in the joint dataset with data from two different instruments combined.Analogously, for construct Q there are then the vectors Q A , Q B , and This composite structure of variables in the combined dataset is crucial for under standing the impact of mean bias.After all, mean bias introduced by instrument differen ces does not affect the whole composite variable.Instead, it would be as if we added a constant to only half of the composite variable.Figure 1 shows the data structure schematically on the left.In this example, we assume that the scale of instrument P B assigns scores that are on average d P = − 2 lower than the scores that instrument P A would assign for respondents with the same true score in construct P. Instrument Q B , meanwhile, assigns scores that are on average d Q = 2 higher than the scores that instrument Q A would assign for respondents with the same true score in construct Q.
If we now plot this combined dataset but differentiate by source surveys (and thus source instruments) we observe a surprising pattern in Figure 1 on the right.Within the data from each survey, constructs P and Q are perfectly correlated with r A = r B = 1.0 (pink and green trend lines).However, if we correlate P C and Q C across the combined data, the correlation drops to a mere r C = .11 (blue trend line).
What has happened?The conditional correlations in two groups are identical, but the correlation across both groups is substantially different.Through the mean bias in P and in Q, we have introduced a version Simpson's paradox into our combined data (Rücker & Schumacher, 2008).In general, Simpson's paradox describes an empirical pattern where we observe the same relationship between two variables in two (or more) groups separately, but a substantially different relationship in analyses across the groups.Of course, Simpson's paradox is not an actual paradox.Instead it is "a form of bias, resulting from heterogeneity in the data [that has not been] accounted for" (Rücker & Schumacher, 2008).If we decompose the problem, we see that the overall correlation r C is a composite of the individual source survey (and thus source instrument) correlations, r A and r B , on the one hand, and of the spurious correlation introduced by the mean biases (d P and d Q in red).To visualize this mean biased induced competing correlation as a trendline through the two bivariate group means in surveys A and B (the plus signs on the trend lines of surveys A and B).In other words, if we do not account for mean bias, ideally by removing it with a suitable harmonization procedure, we bias correlations by Δr = r C − r u .
This simple example showed that, in principle, mean bias between source instruments can result in biased correlations.However, under what circumstances does this correla tion bias occur and with which intensity?In this paper, I set out to map the landscape of this bias with a series of systematic simulations.Specifically, the simulations will demonstrate that the correlation bias due to mean bias in composite data is determined by the interaction of three factors.First, by the mean bias in the composite variable P C .Graphically, in Figure 1 above, we would shift the data of survey B left or right.Second, by the mean bias in the composite variable Q C .Graphically, we would shift the data of survey B up or down.Third, by the strength of the unbiased correlation r u between the constructs P and Q.By unbiased correlation, I mean the correlation we would expect in the absence of mean biases: r u = r C d P = d Q = 0. Graphically, lower or higher values of r u would mean that the datapoints of A and B would fluctuate more or less around the diagonal trendlines for each survey.
By systematically varying all plausible combinations of those three factors, we can map out plausible boundaries of the mean bias induced correlation bias Δr.With these simulations, I aim to provide practical insights into the following questions: 1. How large is the maximum potential bias for plausible mean biases −1 ≤ d ≤ 1? 2. How do different combinations of mean biases d P and d Q impact the correlation bias? 3. How different unbiased correlations r u impact the correlation bias? 4.Under which conditions are absolute empirical effect size r C over-or underestimations of the unbiased absolute effect size r u ? 5. Can mean bias cause correlations to change direction?
The overarching goal of this paper is to clearly map out the extent and shape of a bias in survey data harmonization that practitioners might not have previously considered.Armed with this intuition, harmonization practitioners can better anticipate the risk of incurring a correlation bias in their analyses if substantive variables composed of different source variables are not adequately harmonized.

Method Software
All simulations, analyses, and plots were done in R (R Core Team, 2021) within the RStudio IDE (RStudio Team, 2022).The tidyverse package collection (Wickham et al., 2019) was used for data transformation, automation, and data visualization.The pairs of correlated variables were simulated using the faux package (DeBruine, 2021).

Simulation
To answer the research questions, a three-dimensional matrix of simulations was compu ted.The matrix thus contains simulated estimates of Δr for different bias configurations: i.e., combinations of plausible values of the mean biases d P , d Q in the composite variables and unbiased correlations r u .This bias configuration matrix allows us to systematically map out the resulting correlation biases for all combinations of mean biases and unbiased bivariate correlations.This is crucial because, as we will see in the results, the correla tion bias changes drastically depending on these three factors.In the following, I first describe the basic assumptions, the process of generating a single correlation simulation for a single bias configuration, and then the structure of the whole bias configuration matrix.

Simulation Parameters
To clearly isolate the effects of mean bias, the simulation uses continuous, standard nor mally distributed pairs of variables (i.e., x − = 0, s = 1), and a predefined covariance and thus a predefined correlation r u .These variables were randomly generated using the faux package (DeBruine, 2021).Each simulated vector had a length of 10,000 elements.Since each vector was then duplicated (see next section), each simulation created a combined dataset C with two variables and a sample size N = 20,000.Mean bias is introduced by adding a constant to every value of a variable.This creates pure mean bias in the sense that the mean is the only distribution parameter that changes.Furthermore, since all source variables have the same standard deviation of s = 1, raw mean differences can be directly interpreted as Cohen's d values.

Simulated Mean and Correlation Bias
For this paper, I simulate many bivariate correlations with different parameters.Each correlation is determined by three parameters.First, the mean bias d P in the composite variable P C , second, the mean bias d Q in the composite variable Q, and third the unbiased correlation coefficient between constructs P and Q, r u .Based on those three parameters, a simple harmonized dataset is simulated analogous to the one presented in Figure 1.
The algorithm works like this.First, two vectors of simulated data are created.Both vectors contain standard normally distributed continuous values.The vectors exhibit a bivariate correlation that reflects the unbiased correlation r u that we aim for in this specific simulation.This setup represents data from survey A with variables P A and Q A , measured by survey A's instruments for the constructs P and Q.Second, we duplicate the simulated vectors for survey A and modify it with the mean bias parameters d P and d Q .This creates a simulated survey B with variables P B and Q B , that is identical to survey A except the different measurement unit: For example, a respondent who chose a response x in P A would have chosen a response x + d P in P B .This also ensures that the correlations in each survey are the same and equal to the unbiased correlation we aim for: r A = r B = r u .Third, we combine the simulated data for survey A and survey B together to generate a simulated harmonized dataset C.This means we append P B to P A to form P C and we append Q B and Q A to form Q C . Figure 2 below summarizes this process.Based on these combined vectors P C and Q C , we can calculate the biased correlation r C .The correlation bias Δr can then be calculated as the difference between the correlation biased by mean bias in P and Q and the unbiased correlation: Δr = r C − r u .The correla tion bias Δr can be interpreted as follows: A positive Δr values mean that the biased correlation is numerically higher than the unbiased correlation and negative Δr values mean that the biased correlation is numerically lower than the unbiased correlation.Please note that this is not the same as the absolute correlation effect size being stronger or weaker.For example, a positive Δr in the context of a strong negative unbiased correlation means that the correlation is weaker (e.g., r u = − 1; r C = − 0.6; Δr = 0.4 .

Bias Configuration Matrix
The simulation described above covers one possible bias configuration, defined by the mean bias d P in Variable P C , the mean bias d Q in variable Q C , and the unbiased correlation r u .For each such configuration, we get a correlation bias: To systematically demonstrate how the correlation bias depends on the interaction of these three simulation parameters, all three parameters were varied from -1 to 1 with 41 discrete steps (i.e., -1, -0.95, -0.90 … 0 … 0.90, 0.95, 1).This means the simulations in this paper cover the mean bias range −1 ≤ d ≤ 1 in variables P C and Q C and the un biased correlation range −1 ≤ r u ≤ 1.Then, all possible combinations of the three discrete parameter steps were formed.This resulted in a three-dimensional matrix of parameters with 41 3 = 68,921 parameter combinations.Then this parameter matrix was populated with the correlation bias Δr by running the simulation described above for each of these parameter combinations.The end result then was a matrix of correlation biases where each specific simulated correlation bias was determined by a specific combination of the three parameters.Thus, the simulation parameters formed a coordinate system, where a specific bias estimate can be retrieved via its bias coordinate d P , d Q , r u .Every data view reported in the results section is thus a specific subset of this matrix of correlation biases.

Correlation Bias as a Function of the Combined Bivariate Mean Bias Strength
Our first question was: How large is the maximum potential bias for plausible mean biases −1 ≤ d ≤ 1? To answer this, we can calculate a measure of bivariate mean bias.Specifically, I calculated the Euclidian distance between two points defined by the means of the two variables in survey A and B. In Figure 2 above, this would be the distance between the green and pink plus-signs signifying the bivariate means in surveys A and B. Since the bivariate mean biases are two-dimensional, we can apply the Pythagorean theorem (Gill, 2008) to calculate the Euclidean distance as follows: Biased Correlations in Combined Survey Data 10 If both composite variables are subject to a mean bias of d P = d Q = 1, then the distance would be 2 = 1.41, for example.Meanwhile, if one mean bias is zero, then the distance is equal to the other mean bias.And, of course, all other combinations of mean biases work as well.For example, if d P = 0.5 and d Q = 1 then the distance is 1.12.Figure 3 below now shows all simulations plotted by their bivariate mean distance (x-axis) and their correlation bias Δr (y-axis).For easier interpretation, I have added two x-axis scales.
The bottom scale shows the raw distance measure.The top x-axis, meanwhile, gives an example for the special case where both mean biases have the same absolute value.So a distance of 0.71 can, for example, be the result of the mean biases d P = d Q = 0.5.

Figure 3
Correlation Bias as a Function of Bivariate Mean Bias The graph illustrates that the range of possible correlation biases increases as mean bia ses increase.Specifically, the range of possible correlation biases increases quadratically.
To show this, I have selected only the highest correlation biases for each distance and fitted a linear model which regressed the correlation bias on distance and quadratic distance.The black trendline shows the result.The numbers represent the maximum positive correlation bias at selected distances.If both variables have a mean bias of d = 0.5, then we would expect a bias range of − .12 ≤ Δr ≤ .12, or in other words a span of .22.If both mean biases are d = 1, then this range increases to − .4 ≤ Δr ≤ .4, or a span of .8.At the same time, the graph intuitively shows that the correlation bias depends on more than just the distance.In fact, even for the highest distance, there are still cases without any mean bias.This is because correlation bias depends on the direction of the mean bias in relation to the direction of the unbiased correlation r u as well as on the strength of the unbiased correlation r u .In the following sections, we will unravel these interactions step by step.

Correlation Bias as a Function of the Two Mean Biases Separately
To get a feeling for the interaction between the mean biases d P and d Q in both variables as well as the unbiased correlation r u , let us consider a series of mean bias grids in Figure 4.Each grid varies mean biases systematically along the x and y axes.The colors mean while indicate the correlation bias for each mean bias combination.The five different panels, meanwhile, show the correlation bias pattern for a different underlying unbiased correlation r u .From left to right, r u = − 1; − . 5; 0; . 5; 1.We can thus interpret each panel as a two-dimensional cross-section of the three-dimensional simulation matrix for a given value of r u .Let us first consider the panel in the middle.Here, the two constructs are uncorrelated with r u = 0.However, mean bias introduces spurious correlations.Since correlation bias is calculated as Δr = r C − r u , we see that if the mean biases are both positive or both negative, then we create a spurious positive correlation.In the extreme cases (the up per-right and lower-left corners), we see that d P = d Q = 1 and d P = d Q = − 1 result in a spurious correlation of r C = .2. In the opposite case, where d P = − 1 and d Q = 1 or vice versa, we see a negative spurious correlation of r C = − .2. In other words, the positive absolute unbiased effect size.In Figure 5 below we see the familiar mean bias grids, but this time colors represent areas of reduced effect sizes in blue, inflated effect sizes in yellow, and areas with negligible bias (Δr < .01) in green.The unbiased r u values range from zero to one, because negative values are just mirrored along the x-axis.We see that when r u = 0, then every corner inflates absolute correlations.However, as soon as r u has a direction, then we have inflation in sectors where the mean bias direction aligns with the correlation direction, and reduction in sectors where the mean bias direction is opposite to the correlation direction.There are two further patterns worth noting.First, as r u increases in strength, the inflated sector contracts, the reduction sector expands.Second, as r u increases, bias that reduces the effect size gains in intensity and bias that inflates the effect size loses intensity.

Can Mean Bias Change the Correlation Direction?
A last aspect of practical importance is if mean bias can invert the direction of correla tions.Some researchers are more interested in the direction of effects then the absolute effect size.In such cases, it would be fatal if mean bias changed the direction of an effect.And indeed, there are simulations where the direction of r u is inverted.The minimum absolute distance was 0.57.In other words, in the worst case, an inverted correlation can already occur if both variables are biased with d = 0.4.However, such a change in direction can only occur for unbiased r values r u < .25 as long as the mean biases remain between −1 ≤ d ≤ 1. Supplementary Figure B shows the mean bias areas with inverted correlations in detail (see Singh, 2024b).

Discussion
When we combine data from different source surveys on the same latent constructs, we might incur a mean bias in our combined data.This usually happens if two instruments have different item difficulties, meaning they assign different positions along their scale range to average respondents for a specific population.In practical terms, if the two instruments were applied to the same population, we would get a higher mean value with one instrument than with the other.Such mean biases in combined data should ideally be removed by adequate harmonization techniques, such as equating or multiple imputation.However, if the combined data has not been fully harmonized, such as data that has only been linearly stretched, then mean biases might remain.Such mean biases are problematic in themselves, of course.If the instruments were applied in different populations, mean bias means that we might over-or underestimate the true population differences.However, the simulations in this paper demonstrate that mean bias can bias can also lead to biased bivariate correlations.
The simulations show that mean bias can lead to substantive correlation biases.For mean biases with −1 ≤ d ≤ 1, we observed correlation biases between − .4 ≤ Δr ≤ .4. For a specific unbiased correlation r u , we observed a range of correlation biases of max Δr − min Δr ≈ .4 .And even if mean biases are weaker, this still implies a worrisome range of biases.
Furthermore, the simulations demonstrate how complex the interaction between the mean biases in each variable and the unbiased correlation is.In some configurations, even very large mean biases lead to little if any correlation biases.However, these cases mainly occur either in cases where the unbiased correlation is zero or where it approaches a perfect negative or positive correlation.In more realistic correlation ranges, all substantive mean bias configuration lead to correlation biases.Even for cases, where only one of the two composite variables is subject to mean bias.
The simulations also revealed two additional patterns of practical interest.First, mean biases can both inflate or reduce the absolute effect size.If we obtain an empirical correlation from a harmonized dataset and suspect that mean bias might be present, this means that the unbiased correlation might be higher or lower than the empirical correlation.Second, if the unbiased correlation r u is low, then its effect direction may be inverted by some mean bias configurations.Specifically, empirical correlations of r C < . 2 might misrepresent the unbiased correlation direction if we suspect mean biases up to a range of −1 ≤ d ≤ 1.

Conclusion
The paper has two main practical implications: (1) Wherever possible, mean bias should be mitigated by applying appropriate ex-post harmonization procedures, such as ob served score equating (Singh, 2022) or multiple imputation using a calibration sample Singh (Siddique et al., 2015).( 2) Where mean biases cannot be mitigated, correlations based on multi-source data should be interpreted with caution: The direction of small correlations may have been inverted and comparisons of the relative correlation strength across different instruments might be misleading.

Funding:
The author has no funding to report.

Figure 1 Schematic
Figure 1 Schematic Overview of a Composite Dataset and the Resulting Simpson's Paradox

Figure 2
Figure 2 Schematic Overview of the Simulation Process for a Specific Combination of d P , d Q , and r u

Figure 4 Correlation
Figure 4Correlation Bias as a Function of d P and d Q for Selected Values of r u

Figure 5
Figure 5Reduced or Inflated Absolute Effect Sizes