Multicollinearity

Multicollinearity

What is Multicollinearity?

Multicollinearity occurs when two or more explanatory variables in the regression model are highly correlated with each other making it difficult to isolate their individual effects on the dependent variable. Multicollinearity has two types:

  • Perfect Multicollinearity
  • Imperfect Multicollinearity.

Perfect multicollinearity

Perfect multicollinearity refers to the exact linear relationship among some or all explanatory variables in the model. Mathematically,

\lambda_1 X_1 + \lambda_2 X_2 + \cdots + \lambda_k X_k = 0

X_2 = -\frac{\lambda_1}{\lambda_2}X_1 - \frac{\lambda_3}{\lambda_2}X_3 - \cdots - \frac{\lambda_k}{\lambda_2}X_k

X2 is exactly linearly related to other variables.

Imperfect multicollinearity

Imperfect multicollinearity refers to the inexact linear relationship among some or all explanatory variables in the model. Mathematically,

\lambda_1 X_1 + \lambda_2 X_2 + \cdots + \lambda_k X_k + v = 0

X_2 = -\frac{\lambda_1}{\lambda_2}X_1 - \frac{\lambda_3}{\lambda_2}X_3 - \cdots - \frac{\lambda_k}{\lambda_2}X_k - \frac{1}{\lambda_2}v

Here, X2  is not an exact linear combination of other X variables because it is also determined by a stochastic term v.

Example

X2X2X4
105052
157575
189097
24120129
30150152

It is apparent that X₃=5X₂.. Therefore, there is perfect collinearity between X₂ and X₃ since the coefficient of correlation r₂₃ is unity.

And X4=5X2+v. Now there is no longer perfect collinearity between X2 and X4. However, the two variables are highly correlated since the coefficient of correlation between them is 0.9959.

Consequences of Multicollinearity

Consequences of Perfect Multicollinearity

  1. The regression coefficients are indeterminate.
  2. The standard errors are infinite.

Consequences of Imperfect Multicollinearity

Theoretical Consequences of Imperfect Multicollinearity

  1. Even though OLS estimators are unbiased in the presence of imperfect multicollinearity, but it is a multi-sample or repeated sample property, that is if we obtain several samples and compute the OLS estimates for each of these samples, the average value of the estimates will tend to converge to the true population value, thus E(\hat{\beta}_i)=\beta_i, but it says nothing about property of OLS estimator in a single sample.
  2. Although OLS estimators have minimum variance in the class of all linear unbiased estimators, but this does not mean that the variance of OLS estimators will necessarily be small in relation to the value of the estimator in any given sample. Therefore, the computed t value would be lower and it leads to not reject the null hypothesis and the estimate would turn out to be zero.
  3. In case of near or imperfect multicollinearity it is difficult to isolate the partial effect of each independent variable on Y.

Practical Consequences of Imperfect Multicollinearity

  1. Regression coefficients are determinate but cannot be estimated precisely.
  2. Although BLUE, but the OLS estimators have large variances and covariances,
  3. Confidence intervals tend to be much wider, leading to the acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero).
  4. The t-ratio of one or more coefficients tends to be statistically insignificant.
  5. Although the t-ratio of one or more coefficients is statistically insignificant, R2, the overall measure of goodness of fit, can be very high.
  6. The OLS estimators and their standard errors can be sensitive to small changes in the data, that is they become unstable.
  7. Sign reversal of regression coefficients.

Now we discuss these consequences in detail.

Large Variances and Covariances

\mathrm{Var}(\hat{\beta}_1)=\frac{\hat{\sigma}_u^2}{\sum x_1^2 (1-r_{12}^2)}

\mathrm{Var}(\hat{\beta}_2)=\frac{\hat{\sigma}_u^2}{\sum x_2^2 (1-r_{12}^2)}

\mathrm{Cov}(\hat{\beta}_1,\hat{\beta}_2)=\frac{-r_{12}\hat{\sigma}_u^2}{\sqrt{\left(\sum x_1^2\right)\left(\sum x_2^2\right)}(1-r_{12}^2)}

Where r_{12}^2 is the correlation coefficient squared between X1 and X2. If it equals 1, the standard errors and covariance are infinite, and as collinearity increases, the variances and covariances of the two estimators also increase.

The speed with which variances and covariances of estimators increase is measured by the Variance Inflation Factor (VIF). It is calculated as:

If X-variables are perfectly correlated, then VIF approaches 0. If the correlation coefficient between X1 and X2 tends to 1, VIF becomes infinite. If X-variables are not correlated at all, then VIF=1.

\mathrm{VIF}=\frac{1}{1-r_{12}^2}

The inverse of VIF is called Tolerance. It is calculated as:

\mathrm{TOL}=\frac{1}{\mathrm{VIF}}

If VIF increases, TOL will decrease

Relationship between Correlation Coefficient, VIF and Tolerance

r_{12}VIFTOL=1/VIF
011
0.51.330.75
0.71.960.51
0.82.780.36
0.95.760.17
0.9510.260.1
0.9716.920.06
0.9950.250.02
0.9951000.01
0.9995000

Wider Confidence Intervals

Multicollinearity leads to wider confidence intervals because it increases the standard errors of the estimated regression coefficients. The formula to calculate the confidence interval is:

95\%\,CI \text{ for } \beta_i = \left[\hat{\beta}_i - t_{\alpha/2}\, \mathrm{se}(\hat{\beta}_i),\; \hat{\beta}_i + t_{\alpha/2}\, \mathrm{se}(\hat{\beta}_i)\right]

Wider confidence intervals lead to the acceptance of the null hypothesis very often, resulting in the insignificance of coefficients.

Insignificant “t-ratios”

Multicollinearity leads to “insignificant” t-ratios because it inflates the standard errors of the estimated coefficients, which reduces the value of the t-statistic. The t-ratio is calculated as the estimated coefficient divided by its standard error. This results in a failure to reject the null hypothesis, making the variable appear statistically insignificant, even though it may actually be important.

\hat{t}_{\beta_i}=\frac{\hat{\beta}_i}{\mathrm{SE}(\hat{\beta}_i)}

Causes/Sources of Multicollinearity

  1. Limited variability in data: When data is collected over a limited range of values, the variables may appear more closely related than they actually are.
  2. Constraints on the model or population being sampled: For example, in the regression of electricity consumption (Y) on income (X1) and house size (X2), there is a physical constraint in the population in that families with higher incomes generally have larger homes.
  3. Adding polynomial terms): Including polynomial terms (like X, X2, X3 ) or interaction terms (X1*X2) can create high correlation among regressors.
  4. Over-determined model: When the number of independent variables (k) exceeds the number of observations (n), it leads to multicollinearity. This situation is common in medical research.
  5. Common trends: Time series variables like GDP, population, and investment often have similar upward or downward trends over time, leading to multicollinearity.
  6. Dummy variable trap: When we introduce as many dummies as there are categories without excluding a base/reference category it leads to perfect multicollinearity.
  7. Different units or scales of measurement: Using the same variable in different units (e.g., price in PKR and USD) may cause collinearity.
  8. Inclusion of overlapping or closely related variables: For example, including both years of education and education level (primary, secondary, etc.) can cause multicollinearity.
  9. Including lagged terms: In time series data, using lagged values of variables (e.g., Y_{t-1},\; Y_{t-2}) may lead to multicollinearity if they are highly correlated with each other.

Detection of Multicollinearity

1. High R2 but few significant t-ratios

This is the “classic” symptom of multicollinearity. If R2 is high, say more than 0.8, and the F test suggests that the overall model is significant, but none or few individual t-ratios of regression coefficients are statistically significant than there might be multicollinearity.

2. High Pair-Wise Correlations among Regressors.

If the pairwise or zero-order correlation coefficient between two regressors is high, say, more than 0.8, then multicollinearity is a serious problem.

3. Examination of Partial Correlations

Farrar and Glauber have suggested examining partial correlation coefficients. Thus, in the regression of Y = \beta_0 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + u, a finding that overall R^2_{Y.234} is very high but r^2_{Y2.34},\; r^2_{Y3.24},\; r^2_{Y4.23} are comparatively low may suggest that the variables X_2, X_3, X_4 are highly intercorrelated.

4. Auxiliary Regression.

Another way to detect multicollinearity is to run auxiliary regression among explanatory variables and compute R2 from each regression, which we will call R_i^2. From each R2 we can compute the F value corresponding to each auxiliary regression from the following equation.

F_i=\frac{\left(R^2_{x_i\cdot x_2 x_3 \dots x_k}\right)/(k-2)}{\left(1-R^2_{x_i\cdot x_2 x_3 \dots x_k}\right)/(n-k-1)}

If the computed F exceeds the critical F at the chosen level of significance, we conclude that the particular Xi is collinear with other X’s; if it does not exceed the critical F, we say that it is not collinear with other X’s.

5. Klein Rule of Thumb.

The Kelin rule of thumb criterion states that if R2 obtained from an auxiliary regression is greater than the overall R2 this suggests multicollinearity.

6. Tolerance or Variance Inflation Factor.

The variance inflation factor shows how the variance of the estimator is inflated by the presence of multicollinearity. If the VIF value exceeds 10 it suggests that the variable is highly collinear and needs to be corrected. The closer TOLj is to zero, the greater the degree of collinearity of that variable with the other regressors.

\mathrm{VIF}_j=\frac{1}{1-R_j^2}

\mathrm{TOL}=\frac{1}{\mathrm{VIF}_j}

7. Scatterplot

It is natural to see the scatter plot for the relationship among explanatory variables. If the scatterplot between two X variables is close to a straight line, it suggests that these two X variables are highly collinear.

8. Eigenvalues and Condition Index.

Using various software, we can find eigenvalues and the condition indices to diagnose multicollinearity. From eigenvalues, we can derive the condition number, k, and condition index, CI, by the following formula.

k=\frac{\lambda_{\max}}{\lambda_{\min}}

\mathrm{CI}=\sqrt{k}

  • = maximum eigenvalues
  • = minimum eigenvalues

If k exceeds 1000 or CI exceeds 30, then there would be severe multicollinearity.

Remedial Measures for Multicollinearity

  1. Do nothing: Multicollinearity is a natural phenomenon, therefore no corrective measures are necessary. This is particularly true if the goal is prediction rather than inference.
  2. Use panel data to increase variability: Another approach used to reduce multicollinearity is to combine cross-sectional and time series data (panel data), which increases the variation in the dataset.
  3. Remove highly collinear variables: Dropping one or more variables with high Variance Inflation Factors (VIF > 10) can mitigate multicollinearity. However, this approach may lead to omitted variable bias or misspecification if important variables are removed.
  4. Transform the variables: In time series data, applying transformations such as logarithms, first differences, or growth rates can help eliminate trends and reduce multicollinearity.
  5. Increase the sample size: Adding more observations increases the variability in the data and can help reduce multicollinearity.
  6. Use dimensionality reduction techniques: Multivariate techniques like Factor Analysis and Principal Components Analysis (PCA) can be used to combine collinear variables.
  7. Apply regularisation methods: Techniques such as Ridge Regression and Lasso Regression introduce a penalty term to the regression estimates, shrinking the coefficients and helping to stabilise the model when multicollinearity is present.
Share this article
Facebook
Twitter
LinkedIn
WhatsApp

Leave a Reply

Your email address will not be published. Required fields are marked *

Sen Capability Approach

Core Values of Development Sustenance: Sustenance is the ability to meet life-sustaining basic needs like food, clothing, shelter, health, and protection. It is the minimum level required for a good life. If any of these basic needs are absent or shorter in supply, the situation is known as absolute underdevelopment.

Read More »

Assumptions of Classical Linear Regression Model (CLRM)

In the previous post, we discussed how to estimate a sample regression model, i.e., and . by applying the OLS method on sample data, both in simple and multiple linear regression models. You can read these posts here: A Numerical Example of Multiple Linear Regression by Hand and Simple Linear Regression

Read More »

Regression Through Origin

Introduction of Regression Through Origin Models So far we have studied models like Where intercept is present. An economic example of these models is the Keynes consumption function written as: Where  is autonomous consumption i.e., level of consumption when income is zero. In some cases, we wish to impose the

Read More »

Education and Economic Development

Health, Education, and Economic Development Health and Education as Objectives of Development Education and health are basic objectives of development; they are important ends in themselves. Health is central to wellbeing, and education is essential for a satisfying and rewarding life. Health and Education as Inputs of Development At the

Read More »