## Correlation and Collinearity explained in layman terms

**Correlation** tells you have two numerical variables relate to each other. It will tell you whether data points that have a higher than average value for one variable will also likely have a higher than average value for the other variable (positive correlation) or smaller than average (negative correlation) or if there is no such relationship (correlation close to zero).

Some examples:

- Height of a person and weight of a person have a high correlation: tall people tend to be heavier than shorter people. The value will be positive, and
- Height of a person and the number formed by the last 4 digits of their phone number are uncorrelated (the correlation will be close to 0) because they are independent of each other.
- The traffic density (number of cars driving at a given time) is negatively correlated with the average speed (if there is more traffic, there will be longer queues at traffic lights, and more people taking turns or moving in and out of traffic).

**What is the difference between collinearity and correlation?**

Correlation means two variables vary together, if one changes so does the other. Correlation gives no indication of strength, just how noisy this relationship is and its direction.

Correlation is an **operator**, meaning that we can talk about the correlation between height and weight. The correlation can be positive, negative, or 0.

Collinearity is a **phenomenon** related to regression, in which some of the predictor variables are **highly correlated among themselves**. This makes one or more of these variables redundant in our analysis. For example: if you wish to regress “Household expenditure” on “Household income” and “Tax paid in the last year”, the income and tax paid will be highly correlated (or there will be collinearity in this setup). It would be best to regress “Expenditure” on either “income” or “tax paid”.

If in multiple regression analysis, one of the predictors is linearly associated/dependent on other predictor, then this issue is known as collinearity.

For example, let’s consider the linear model

Y = αx1 + β1×1 + β2×2 … (1)

If predictor x1 can be expressed as linear combination of x2, say, x1 = 3*x2

Then this is known as collinearity among the predictors. Note that there will be perfect (or very high) correlation between the predictors as opposed to the assumption of linear regression model (All predictors are assumed to be independent).

Essentially it means that one of the independent variables is not really necessary to the model because its effect/impact on the model is already captured by some of the other variables. This variable is not contributing anything extra to the predictions and can be removed. If we have true collinearity (perfect correlation as in the example above), the one of the predictor is automatically deleted by some of the software’s like R, other shows an error or warning for the same.

The effects of collinearity are seen in the variances of the parameter estimates, not in the parameter estimates themselves.

**How could I test whether a calculated correlation coefficient between two variables is meaningful or not?**

The correlation coefficient R lies between -1 to +1.

In general if |R| >= 0.75, we say that the variables are highly correlated. And similarly, poor and moderate correlation for |R| <= 0.25 and 0.25 <= |R| <= 0.75 respectively.

The coefficient of determination R squared measures the ratio of explained variation (in one variable due to the change on other) to total variation.

For example, if R = 0.8 (high correlation), then R squared = 0.64.

Hence only 64 % of variation in one variable is due to the other variable. Rest of the variation (36 %) is caused by other factors.

So it is suggested to interpreted your result after calculating R squared.

Also, great care should be taken (using ROL/ expert opinion/ judgement based on common sense) while making decision on the basis of above mentioned two measures.

As sometimes, we may get a high value of R and R squared between two variables just by chance. For example correlation between amount of rain in a particular city in last one year and number of deaths due to cancer in that city.

For more details, see non- sense correlation/ spurious correlation.

Keep Learning 🙂

The Data Monk

## Leave a reply