Correlation and Collinearity explained in layman terms

Correlation tells you have two numerical variables relate to each other. It will tell you whether data points that have a higher than average value for one variable will also likely have a higher than average value for the other variable (positive correlation) or smaller than average (negative correlation) or if there is no such relationship (correlation close to zero).

Some examples:

Height of a person and weight of a person have a high correlation: tall people tend to be heavier than shorter people. The value will be positive, and
Height of a person and the number formed by the last 4 digits of their phone number are uncorrelated (the correlation will be close to 0) because they are independent of each other.
The traffic density (number of cars driving at a given time) is negatively correlated with the average speed (if there is more traffic, there will be longer queues at traffic lights, and more people taking turns or moving in and out of traffic).

What is the difference between collinearity and correlation?

Correlation means two variables vary together, if one changes so does the other. Correlation gives no indication of strength, just how noisy this relationship is and its direction.

Correlation is an operator, meaning that we can talk about the correlation between height and weight. The correlation can be positive, negative, or 0.

Collinearity is a phenomenon related to regression, in which some of the predictor variables are highly correlated among themselves. This makes one or more of these variables redundant in our analysis. For example: if you wish to regress “Household expenditure” on “Household income” and “Tax paid in the last year”, the income and tax paid will be highly correlated (or there will be collinearity in this setup). It would be best to regress “Expenditure” on either “income” or “tax paid”.

If in multiple regression analysis, one of the predictors is linearly associated/dependent on other predictor, then this issue is known as collinearity.

For example, let’s consider the linear model

Y = αx1 + β1×1 + β2×2 … (1)

If predictor x1 can be expressed as linear combination of x2, say, x1 = 3*x2

Then this is known as collinearity among the predictors. Note that there will be perfect (or very high) correlation between the predictors as opposed to the assumption of linear regression model (All predictors are assumed to be independent).

Essentially it means that one of the independent variables is not really necessary to the model because its effect/impact on the model is already captured by some of the other variables. This variable is not contributing anything extra to the predictions and can be removed. If we have true collinearity (perfect correlation as in the example above), the one of the predictor is automatically deleted by some of the software’s like R, other shows an error or warning for the same.

The effects of collinearity are seen in the variances of the parameter estimates, not in the parameter estimates themselves.

How could I test whether a calculated correlation coefficient between two variables is meaningful or not?

The correlation coefficient R lies between -1 to +1.

In general if |R| >= 0.75, we say that the variables are highly correlated. And similarly, poor and moderate correlation for |R| <= 0.25 and 0.25 <= |R| <= 0.75 respectively.

The coefficient of determination R squared measures the ratio of explained variation (in one variable due to the change on other) to total variation.

For example, if R = 0.8 (high correlation), then R squared = 0.64.

Hence only 64 % of variation in one variable is due to the other variable. Rest of the variation (36 %) is caused by other factors.

So it is suggested to interpreted your result after calculating R squared.

Also, great care should be taken (using ROL/ expert opinion/ judgement based on common sense) while making decision on the basis of above mentioned two measures.

As sometimes, we may get a high value of R and R squared between two variables just by chance. For example correlation between amount of rain in a particular city in last one year and number of deaths due to cancer in that city.

For more details, see non- sense correlation/ spurious correlation.

Keep Learning 🙂

The Data Monk

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page
Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it

For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com