Multicollinearity: Types, Concequences, Causes, Detection & Remedies

One of the assumptions of the Classical Linear Regression Model (CLRM) is that the explanatory variable X has no perfectly linear relationship. In other words, there should be no perfect multicollinearity. For detailed assumptions on CLRM, read Assumptions of Classical Linear Regression Model (CLRM).

Now that we have introduced the concept of multicollinearity, several important questions arise. What exactly is multicollinearity? What are its different types? What causes multicollinearity in regression models, and what consequences does it have for statistical inference? How can researchers detect the presence of multicollinearity, and what remedies are available to address this problem?

These questions are the main focus of this blog.

Table of Contents

What is Multicollinearity?

Multicollinearity occurs when two or more explanatory variables in the regression model are highly correlated with each other making it difficult to isolate their individual effects on the dependent variable. Multicollinearity has two types:

Perfect Multicollinearity
Imperfect Multicollinearity.

Perfect multicollinearity

Perfect multicollinearity refers to the exact linear relationship among some or all explanatory variables in the model. Mathematically,

$\lambda_1 X_1 + \lambda_2 X_2 + \cdots + \lambda_k X_k = 0$

$X_2 = -\frac{\lambda_1}{\lambda_2}X_1 - \frac{\lambda_3}{\lambda_2}X_3 - \cdots - \frac{\lambda_k}{\lambda_2}X_k$

X₂ is exactly linearly related to other variables.

Imperfect multicollinearity

Imperfect multicollinearity refers to the inexact linear relationship among some or all explanatory variables in the model. Mathematically,

$\lambda_1 X_1 + \lambda_2 X_2 + \cdots + \lambda_k X_k + v = 0$

$X_2 = -\frac{\lambda_1}{\lambda_2}X_1 - \frac{\lambda_3}{\lambda_2}X_3 - \cdots - \frac{\lambda_k}{\lambda_2}X_k - \frac{1}{\lambda_2}v$

Here, X₂ is not an exact linear combination of other X variables because it is also determined by a stochastic term v.

Example

X₂	X₂	X₄
10	50	52
15	75	75
18	90	97
24	120	129
30	150	152

It is apparent that X₃=5X₂.. Therefore, there is perfect collinearity between X₂ and X₃ since the coefficient of correlation r₂₃ is unity.

And X₄=5X₂+v. Now there is no longer perfect collinearity between X₂ and X₄. However, the two variables are highly correlated since the coefficient of correlation between them is 0.9959.

Consequences of Multicollinearity

Consequences of Perfect Multicollinearity

The regression coefficients are indeterminate.
The standard errors are infinite.

Consequences of Imperfect Multicollinearity

Theoretical Consequences of Imperfect Multicollinearity

Even though OLS estimators are unbiased in the presence of imperfect multicollinearity, but it is a multi-sample or repeated sample property, that is if we obtain several samples and compute the OLS estimates for each of these samples, the average value of the estimates will tend to converge to the true population value, thus $E(\hat{\beta}_i)=\beta_i$ , but it says nothing about property of OLS estimator in a single sample.
Although OLS estimators have minimum variance in the class of all linear unbiased estimators, but this does not mean that the variance of OLS estimators will necessarily be small in relation to the value of the estimator in any given sample. Therefore, the computed t value would be lower and it leads to not reject the null hypothesis and the estimate would turn out to be zero.
In case of near or imperfect multicollinearity it is difficult to isolate the partial effect of each independent variable on Y.

Practical Consequences of Imperfect Multicollinearity

Regression coefficients are determinate but cannot be estimated precisely.
Although BLUE, but the OLS estimators have large variances and covariances,
Confidence intervals tend to be much wider, leading to the acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero).
The t-ratio of one or more coefficients tends to be statistically insignificant.
Although the t-ratio of one or more coefficients is statistically insignificant, R², the overall measure of goodness of fit, can be very high.
The OLS estimators and their standard errors can be sensitive to small changes in the data, that is they become unstable.
Sign reversal of regression coefficients.

Now we discuss these consequences in detail.

Large Variances and Covariances

$\mathrm{Var}(\hat{\beta}_1)=\frac{\hat{\sigma}_u^2}{\sum x_1^2 (1-r_{12}^2)}$

$\mathrm{Var}(\hat{\beta}_2)=\frac{\hat{\sigma}_u^2}{\sum x_2^2 (1-r_{12}^2)}$

$\mathrm{Cov}(\hat{\beta}_1,\hat{\beta}_2)=\frac{-r_{12}\hat{\sigma}_u^2}{\sqrt{\left(\sum x_1^2\right)\left(\sum x_2^2\right)}(1-r_{12}^2)}$

Where $r_{12}^2$ is the correlation coefficient squared between X₁ and X₂. If it equals 1, the standard errors and covariance are infinite, and as collinearity increases, the variances and covariances of the two estimators also increase.

The speed with which variances and covariances of estimators increase is measured by the Variance Inflation Factor (VIF). It is calculated as:

If X-variables are perfectly correlated, then VIF approaches 0. If the correlation coefficient between X₁ and X₂ tends to 1, VIF becomes infinite. If X-variables are not correlated at all, then VIF=1.

$\mathrm{VIF}=\frac{1}{1-r_{12}^2}$

The inverse of VIF is called Tolerance. It is calculated as:

$\mathrm{TOL}=\frac{1}{\mathrm{VIF}}$

If VIF increases, TOL will decrease

Relationship between Correlation Coefficient, VIF and Tolerance

$r_{12}$	VIF	TOL=1/VIF
0	1	1
0.5	1.33	0.75
0.7	1.96	0.51
0.8	2.78	0.36
0.9	5.76	0.17
0.95	10.26	0.1
0.97	16.92	0.06
0.99	50.25	0.02
0.995	100	0.01
0.999	500	0

Wider Confidence Intervals

Multicollinearity leads to wider confidence intervals because it increases the standard errors of the estimated regression coefficients. The formula to calculate the confidence interval is:

$95\%\,CI \text{ for } \beta_i = \left[\hat{\beta}_i - t_{\alpha/2}\, \mathrm{se}(\hat{\beta}_i),\; \hat{\beta}_i + t_{\alpha/2}\, \mathrm{se}(\hat{\beta}_i)\right]$

Wider confidence intervals lead to the acceptance of the null hypothesis very often, resulting in the insignificance of coefficients.

Insignificant “t-ratios”

Multicollinearity leads to “insignificant” t-ratios because it inflates the standard errors of the estimated coefficients, which reduces the value of the t-statistic. The t-ratio is calculated as the estimated coefficient divided by its standard error. This results in a failure to reject the null hypothesis, making the variable appear statistically insignificant, even though it may actually be important.

$\hat{t}_{\beta_i}=\frac{\hat{\beta}_i}{\mathrm{SE}(\hat{\beta}_i)}$

Causes/Sources of Multicollinearity

Limited variability in data: When data is collected over a limited range of values, the variables may appear more closely related than they actually are.
Constraints on the model or population being sampled: For example, in the regression of electricity consumption (Y) on income (X₁) and house size (X₂), there is a physical constraint in the population in that families with higher incomes generally have larger homes.
Adding polynomial terms): Including polynomial terms (like X, X², X³ ) or interaction terms (X₁*X₂) can create high correlation among regressors.
Over-determined model: When the number of independent variables (k) exceeds the number of observations (n), it leads to multicollinearity. This situation is common in medical research.
Common trends: Time series variables like GDP, population, and investment often have similar upward or downward trends over time, leading to multicollinearity.
Dummy variable trap: When we introduce as many dummies as there are categories without excluding a base/reference category it leads to perfect multicollinearity.
Different units or scales of measurement: Using the same variable in different units (e.g., price in PKR and USD) may cause collinearity.
Inclusion of overlapping or closely related variables: For example, including both years of education and education level (primary, secondary, etc.) can cause multicollinearity.
Including lagged terms: In time series data, using lagged values of variables (e.g., $Y_{t-1},\; Y_{t-2}$ ) may lead to multicollinearity if they are highly correlated with each other.

Detection of Multicollinearity

1. High R² but few significant t-ratios

This is the “classic” symptom of multicollinearity. If R² is high, say more than 0.8, and the F test suggests that the overall model is significant, but none or few individual t-ratios of regression coefficients are statistically significant than there might be multicollinearity.

2. High Pair-Wise Correlations among Regressors.

If the pairwise or zero-order correlation coefficient between two regressors is high, say, more than 0.8, then multicollinearity is a serious problem.

3. Examination of Partial Correlations

Farrar and Glauber have suggested examining partial correlation coefficients. Thus, in the regression of $Y = \beta_0 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + u$ , a finding that overall $R^2_{Y.234}$ is very high but $r^2_{Y2.34},\; r^2_{Y3.24},\; r^2_{Y4.23}$ are comparatively low may suggest that the variables $X_2, X_3, X_4$ are highly intercorrelated.

4. Auxiliary Regression.

Another way to detect multicollinearity is to run auxiliary regression among explanatory variables and compute R² from each regression, which we will call $R_i^2$ . From each R²we can compute the F value corresponding to each auxiliary regression from the following equation.

$F_i=\frac{\left(R^2_{x_i\cdot x_2 x_3 \dots x_k}\right)/(k-2)}{\left(1-R^2_{x_i\cdot x_2 x_3 \dots x_k}\right)/(n-k-1)}$

If the computed F exceeds the critical F at the chosen level of significance, we conclude that the particular Xi is collinear with other X’s; if it does not exceed the critical F, we say that it is not collinear with other X’s.

5. Klein Rule of Thumb.

The Kelin rule of thumb criterion states that if R² obtained from an auxiliary regression is greater than the overall R²this suggests multicollinearity.

6. Tolerance or Variance Inflation Factor.

The variance inflation factor shows how the variance of the estimator is inflated by the presence of multicollinearity. If the VIF value exceeds 10 it suggests that the variable is highly collinear and needs to be corrected. The closer TOLj is to zero, the greater the degree of collinearity of that variable with the other regressors.

$\mathrm{VIF}_j=\frac{1}{1-R_j^2}$

$\mathrm{TOL}=\frac{1}{\mathrm{VIF}_j}$

7. Scatterplot

It is natural to see the scatter plot for the relationship among explanatory variables. If the scatterplot between two X variables is close to a straight line, it suggests that these two X variables are highly collinear.

8. Eigenvalues and Condition Index.

Using various software, we can find eigenvalues and the condition indices to diagnose multicollinearity. From eigenvalues, we can derive the condition number, k, and condition index, CI, by the following formula.

$k=\frac{\lambda_{\max}}{\lambda_{\min}}$

$\mathrm{CI}=\sqrt{k}$

$λ_{m a x}$ = maximum eigenvalues
$λ_{min}$ = minimum eigenvalues

If k exceeds 1000 or CI exceeds 30, then there would be severe multicollinearity.

Remedial Measures for Multicollinearity

Do nothing: Multicollinearity is a natural phenomenon, therefore no corrective measures are necessary. This is particularly true if the goal is prediction rather than inference.
Use panel data to increase variability: Another approach used to reduce multicollinearity is to combine cross-sectional and time series data (panel data), which increases the variation in the dataset.
Remove highly collinear variables: Dropping one or more variables with high Variance Inflation Factors (VIF > 10) can mitigate multicollinearity. However, this approach may lead to omitted variable bias or misspecification if important variables are removed.
Transform the variables: In time series data, applying transformations such as logarithms, first differences, or growth rates can help eliminate trends and reduce multicollinearity.
Increase the sample size: Adding more observations increases the variability in the data and can help reduce multicollinearity.
Use dimensionality reduction techniques: Multivariate techniques like Factor Analysis and Principal Components Analysis (PCA) can be used to combine collinear variables.
Apply regularisation methods: Techniques such as Ridge Regression and Lasso Regression introduce a penalty term to the regression estimates, shrinking the coefficients and helping to stabilise the model when multicollinearity is present.

Suggestions for further readings

Share this article

Leave a Reply Cancel reply

Permanent Income Hypothesis Explained: Friedman’s Theory of Consumption

Permanent Income Hypothesis Explained: Friedman’s Theory of Consumption Introduction Imagine two friends. One gets a

July 14, 2026 No Comments

Functional Forms of Regression Models: The Semi-Log Model in Econometrics

Functional Forms of Regression Models The Semi-Log Model in Econometrics In the previous article, we

July 12, 2026 No Comments

Functional Forms of Regression Models: The Double-Log Model in Econometrics

Various Functional Forms of Regression Models Double-Log Model Explained with Examples Introduction Regression analysis is

July 11, 2026 1 Comment

Macroeconomics

Relative Income Hypothesis: Duesenberry’s Theory of Consumption Explained

Relative Income Hypothesis How Social Comparison Shapes Consumption Why do two families earning the exact

July 9, 2026 1 Comment

Saving Function

Saving Function Definition The saving function shows the functional relationship between saving and disposable income.

July 8, 2026 No Comments

Muhammad Minhaj Akhtar

Muhammad Minhaj Akhtar is a Lecturer in Economics at Government Graduate College Jauharabad, Pakistan. He holds an M.Phil. in Economics from Quaid-i-Azam University, Islamabad, and an MSc in Economics from the University of Sargodha, where he earned a Silver Medal. His academic passion lies in Econometrics, with a strong focus on applying empirical methods to real-world economic issues. Through MinhajMetrixHub, he shares learning resources, research guidance, and practical econometric insights for students and researchers.