Data Science Terms which are often confused

Data Science = Maths+Code+Business Understanding

Many a time you come across different terminologies which sounds confusing. We will try to make them easier for you to understand and to remember

1.Data Scientist vs Data Analyst

Data Scientist helps you understand the “whatif’s” associated with a problem whereas Data Analyst gets you insights, builds report and present it to the Client.

Data Scientists mostly works around solving long term problems like building an image processor, optimizing sales route, forecasting, etc.
Whereas Data Analysts are mostly occupied with urgent requests, adhocs, etc.. They do have the liberty to work on ML/AI but it will not be the major chunk of his work.

Data Scientist tries to find answers to their own questions whereas Data Analyst answers the asked questions.

And then there are Decision Scientists 😛

2. Linear Regression vs Logistic Regression

-Linear regression is used when the dependent variable is continuous and the nature of the regression line is linear. Whereas Logistic Regression is used when the dependent variable is binary in nature.
Example – Forecasting sales of McD is a Linear problem. Forecasting if a person is depressed is a Logistic Regression problem.

-Linear Regression gives you a value whereas Logistic gives you the probability of success or failure

-The pre-requisite or assumption of Linear Regression is the “linear relationship between the dependent and independent variable”. There is no such assumption in Logistic Regression

-In the linear regression, the independent variable can be correlated with each other. On the contrary, in the logistic regression, the variable must not be correlated with each other

-Logistic Regression is used for Binary classification

3. Dependent vs Independent variable

The dependent variable is your “Y” and Independent variable is your “X” in the equation
Y = mX+constant

An independent variable is a variable that is changed in a scientific experiment to test the effects on the dependent variable.

Example – If I want to forecast the temperature of Bangalore on 15th August using the variables like Temperature of the previous week, humidity, wind speed, etc. then Y is the temperature which you want to forecast and X are humidity, wind speed, etc.

4. Forecasting vs Prediction

These two looks fairly similar but there is a striking difference between the two.
-Forecasting is a way of finding values of the future by looking at the historical data. Whereas prediction helps you in finding answers to the future.
-Forecasting is scientific whereas prediction is subjective (and vague)

Example – You can forecast the number of Biryani’s sold in Mani’s Dum Biryani on a weekend by looking at the historical data.
An astrologer can predict your future by looking at your ever-changing palm lines.

5. K-NN vs K means

– kNN stands for K Nearest Neighbor whereas K-means is K-means
– kNN is a supervised learning algorithm used for classification and regression problem. Whereas K-means an unsupervised learning algorithm used for the clustering problem
– Basically, in kNN, we set the value of K which suggests the algorithm about the number of neighbors a new dataset has to consider before classifying it into one bucket.

Example – Suppose we want to classify image into two groups i.e. Cat and Dog. Since it’s a Supervised Learning technique, so there will be some co-ordinates for the already classified images

Here, red circle is Cat, blue one is dog and the Black rectangle is the new data point. The number lines is equal to the value of K

Above we have a really messy co-ordinate system. You have already specified the value of K for this dataset i.e. Nearest Neighbor as 6. Now whenever a new data point arrives, it draws 6 connections to the nearest values. Here you can see that the number of blue circles are 4 and that of red is 2. So, the new dataset will be classified as blue.

Since, K-means is an unsupervised learning method so we don’t have a training dataset with the correct output. If you are unsure about the difference between supervised and unsupervised learning methods then go through the next point first.
K-means belongs to the family of moving centroid algorithms, i.e. at every iteration the centroid of the cluster moves slightly to minimize the objective function.

Basically, You start with placing a data point on the co-ordinates and then the next data point will be placed on the co-ordinates with respect to the previous point. Similarly, the centroid of the cluster is adjusted.

6. Supervised Learning vs Unsupervised Learning

Supervised means a thing which you can monitor. Supervised learning includes all the algorithms where you know the output of some data. You train your model on these data assuming the fact that these are correct data points. And then you build a model on top of it.

Example – We want to know the number of customers which will come to my restaurant in November. Now, I have the number of customers who have visited my restaurant in the last 3 years. So, we have some data points of the past, we can build a forecasting model using these data points and then we can predict the customers visiting in coming November.

Anything for which we know the output for a few data points will fall under supervised learning

A supervised learning needs some output to build a model. An unsupervised learning algorithm needs nothing. It will build a model on your training dataset y finding connection between different values and it will keep iterating the process until all the data points are considered. An example will help you understand better:-

Example – You have things with different geometric shape, some are circular, some are oval, square, rectangular, etc. You need to make bucket these into 4 parts. Now the algorithm which you will use does not know anything about bucketing, it only knows that you need 4 buckets. It will most probably take the first 4 items and place them on a co-ordinate. Now each object coming in will be allocated near to one of the four buckets. The algorithm will keep iterating till you are done with all the items. By then end of the run, you will have 4 buckets. This is unsupervised learning

7. Training vs Test Dataset

Suppose you have 1000 rows of data and you want to create a forecasting model. You start with LSTM algorithm.

Since forecasting is a supervised learning and you have 1000 rows of historical data, so, you will split your dataset in either 80:20 or 70:30 or any breakdown depending on the business need.

Now you would like to build your model i.e. you want to train your model to act in a similar model when you have a new dataset. The model will create a lot of “ifs and buts”, the dataset on which it will build it’s set of rules is called training dataset. This dataset gives a gist of rules governing the model.

Now you have trained your model on 80% of the data. You would not like to test the model on the real time data, right?
Because you would not like your model to break down on the new data points. So, you have to test the data on that set for which you already know the output i.e. the 20% of the data set for which you know the output but have not included in training the model.

This is not that difficult to understand once you build any statistical model.

8. Random Forest vs Decision Tree

If you have participated in Hackathons, then you must be knowing these algorithms. These two and XGB are one of the best combinations to get a good rank on the table.

– A decision tree is built on an entire dataset, using all the features/variables of interest, whereas a random forest randomly selects observations/rows and specific features/variables to build multiple decision trees from and then averages the results.

– Random Forest is made of many Decision “Trees”. So, a Decision Tree is more like a flow chart which finally gets you to a decision. When you combine these trees, you get a forest.

– The reason why Random Forest os almost 99 to 100% accurate on the training dataset is that it takes all the possible combinations to boil down to the already provided output. But it might fail to give the same result on the testing dataset whenever there is a new combination of attributes. This is called overfitting.

We will keep on updating the article.

Keep Learning 🙂

The Data Monk