Supply Chain Analytics – Using PuLP in Python

There are three types of programming which we can do in Supply Chain

a. Linear Programming – It involves creating a model on continuous variables
b. Integer Programming – It involves creating a model on only Discrete or Integer value
c. Mixed Integer Programming – It is a mix of continuous and discrete variables

What is PuLP?
PuLP is a Python Library that enables users to describe mathematical programs. PuLP works entirely within the syntax and natural idioms of the Python language by providing Python objects that represent optimization problems and decision variables, and allowing constraints to be expressed in a way that is very similar to the original mathematical expression.

PuLP has focused on supporting linear and mixed-integer models.

We will be using PuLP to solve some Supply Chain Problems

Introduction to PuLP in Supply Chain Analytics

PuLP as you know is an Integer Programming/Linear Programming Modeler. There are three parts of creating a model in PuLP:-
a. Decision Variables – These are the variables which impacts the Supply Chain. For example, Number of pressure cooker is a decision variable to cook Rice. More number of Pressure cooker will help you in cooking more rice
b. Objective Functions – These are the mathematical equations to use Decision variables to optimize your process i.e. either maximize or minimize something. Example, You have 3 pressure cooker and 2 Stoves, and you have to cook Rice and Lentils. Now you need to figure out an objective function to cook maximum rice and lentils in 1 hour using the above utensils
c. Constraints – These are the things which limits our ideal or optimized solution.

Let’s take a case study of Supply Chain optimization.

There is a Restaurant which serves Mega Pizza (40”).  It has one oven, 3 bakers, and 1 packer. Following is the time required by each Pizza

  Number Pizza A Pizza B Pizza C Working Days
Oven 1 Oven 1 Day 0.5 Day 1 Day 30 Days
Baker 3 Bakers 1 Day 2 Days 2 Days 30 Days
Packer 2 Packers 1 Day 1 Day 1 Day 20 Days
Profit   $30 $40 $50  

Now you have to maximize the Profit using PuLP library. Use decision variables, objective functions, and constraints.

How much pizza of each type should we make in 30 days.

First let’s look into the coding part in Python

from pulp import *
model = LpProblem(“Maximize Pizza Profit”, LpMaximize)

#Declare Decision Variable
A = LpVariable(‘A’,lowbound=0,upbound = None,cat=’Integer’)
B = LpVariable(‘B’,lowbound=0, upbound = None, cat=’Integer’)
C = LpVariable(‘C’,lowbound=0,upbound = None, cat=’Integer’)

#Define Objective function
#For Oven
model += 1*A + 0.5*B + 1*C <=  30
#For Baker
model += 1*A+2*B+2*C <=90
#For Packer
model += 1*A+1*B+1*C <= 40

#Solve Model
print(“Produce {} Pizza A”.format(A.varValue))
print(“Produce {} Pizza B”.format(B.varValue))
print(“Produce {} Pizza C”.format(C.varValue))

Now let’s understand the code

from pulp import *
Here you are importing the complete package

model = LpProblem(“Maximize Pizza Profit”, LpMaximize)
Here you are defining the model using LpProblem function. The LpMaximize will look for maximizing the value i.e. Profit. If you want to get the minimum value from the model then use LpMinimize. We can use LpMinimize when we are talking about reducing the wastage.

A = LpVariable(‘A’,lowbound=0,upbound = None,cat=’Integer’)
Here we define each Variable using LpVariable function. Lowbound refers to the lowest possible value of the variable.
Pizza can not be negative so we have given the value 0, Upbound is the maximum value of the variable.
None will ensure that the upbound could be anything
cat is the characteristic of the variable. It could be integer, categorical, or Binary

model += 1*A + 0.5*B + 1*C <=  30
This is the constraint for Oven. A requires 1 day, B requires 0.5 Day, and C requires 1 Day. The <=30 is the constraint which is because there is one oven which will work for 30 days

model += 1*A+2*B+2*C <=90
Similar to the above, the Baker will need 1, 2, and 2 days for A,B, and C respectively. And there are 3 Bakers which work 30 days. Thus constraint is 30*3 = 90

#For Packer
model += 1*A+1*B+1*C <= 40

A packer takes 1,1,and 1 day for A,B, and C pizza. And there are 2 Packers who  works 20 days each. Thus constraint is 40.

This is a simple example with fixed constraints. We will deal with variable components in the next article 🙂

Keep Learning 🙂

Tha Data Monk

Supply Chain Analytics

We all have a fair idea about the supply chain. In a layman term, we can say that the supply chain analytics helps in improving the operational efficiency and effectiveness by providing “data-driven” decisions at the operational and strategic level.

There are four types of analytics which one can perform to boost the performance of a supply-chain enabled business:-

Kartik had a mild migraine a few days ago, he ignored it and continued with his daily routine. After a few days, he found out that the pain is getting worse with time, He consulted Dr.Nikhil, who first asked for his medical history/reports i.e. Weight, Sugar-level, Bloop pressure, etc.
Then he looked into the reports and tried to diagnose the reason behind this abrupt pain.
The ache was there all the time which made Doctor believe that it is bound to happen in the future, so after looking at all the major points, Nikhil prescribed some medicine to Kartik.

What is what?

Reports ~ KPIs of the industry and business i.e. Descriptive Analytics
Diagnosis ~ Looking for reasons for the numbers present in the report i.e. Diagnostic analytics
Prediction of future pain to Kartik ~ Predictive analytics
Prescribing medicine ~ Looking at all the past behavior and KPIs, we do Prescriptive analytics

1. Descriptive analytics – So, you have some historic data and you need to find the performance of KPIs, this type of analysis is called descriptive analysis. The descriptive analysis helps us in finding answers to questions like How many products were sold, the performance of products in different stores, the performance of stores in terms of revenue, etc.

Basically, it gives you a gist of the current state of a company

2. Diagnostic analytics – On one hand, the descriptive analysis tells you about the KPIs of the company, whereas the diagnostic analytics tells you a lot about the underlying issue. If the descriptive analysis tells you that
Product A is not performing well in the Whitefield Store of Walmart, then the diagnostic analysis will aim at finding the underlying reasons for the same.

3. Predictive analytics –

“Do you understand the difference between Forecasting and prediction?”
Forecasting is the use of historic data which holds some pattern, to give a number for the future i.e. you are basically extrapolating the past pattern to get the numbers for the future. Whereas prediction is a more vague term which takes the changes of future in the account.

When I go through the last 40 months of data to estimate the number of joints rolled by Aman in the next month, then this is a case of forecasting. But, if I read the palm of Ramesh and tells him his future by considering the present and future behavior of the stars, then it’s a prediction.

Predictive analytics uses statistical techniques to estimate the likelihood of future events such as stock-outs or movements in your product’s demand curve. It provides the foresight for focused decision making that avoids likely problems in the future.

4. Prescriptive Analytics – Now it’s the time to have an overview of all the analytics components and provide solutions which can improve the performance of the business. This is done by prescriptive analytics.

Descriptive talks about the KPIs, diagnostic tries to find out the reason behind these numbers, predictive wants to know the performance of the business by looking at the historic and futuristic actions, prescriptive provides the final prescriptions !!

Components of Supply Chain Analytics:-

Overall, supply chain analytics can be divided into 5 parts:-

1. Inventory Management – This part looks after the “store-house” of a company. The major parts of analytics here are

a. Inventory Cost Optimization
b. Optimal Stocking
c. Stock-out Prediction

2. Sourcing – How to full fill the demand

a. Optimized Order Allocation
b. Arrival time optimization
c. Sourcing cost analysis

3. Vendor Management – How to optimize vendors for your company

a. Fault Rate Analysis
b. Profitability Analysis
c. Vendor Scorecard

4. Returns Management – What to do if a product is returned?

a. Returns tracking
b. Salvaging optimization
c. Cost Recovery Analysis

5. Network Planning – How to optimize the transport network to maximize profit?

a. Trailer Utilization
b. Freight Cost Optimization
c. Vehicle Routing

What are the five stages of Supply Chain?

You can divide the whole Supply chain process in 5 stages
a. Plan – Every company needs a strategy on how to manage the resources in order to achieve their customers demand for their products and services
b. Source – To create their products, companies need to be very careful when choosing suppliers to deliver their goods and services needed
c. Make – In manufacturing the supply chain manager should always schedule the activities that are needed for the production, packaging, testing and preparation for delivery.
d. Deliver – This part is mainly referred to as logistics by the supply chain management. In this case companies coordinate receipts of orders, pick carriers to get products to customers and develop a network of warehouses.
e. Return – In many companies this is usually where the problem is – in the supply chain. The planners should create a flexible and responsible network for receiving a flaw and excess products sent back to them (from customers).

Common Terminologies in Supply Chain

1. Back Ordering – When you don’t have product in your inventory and the product has already been ordered by a customer. In this case you give the order to a supplier. This is called Back-Ordering

2. Blanket Order – It is a large purchase order registered by the end user which the supplier has to supply in a span of few days where the dates are not fixed. It’s just like saying “I need 5000 Light candles before October 31st”. This will ensure a large order aiming for a good amount of discount before a festive or high demand season

3. Consignment –  This term has more than one meaning. Most often it means the act of placing your goods in the care of a third-party warehouse owner (known as the consignee) who maintains them for a fee. In addition to storing the goods, the consignee may sell or ship them to customers on your behalf for a fee. As a separate meaning, consignments can also refer to individual shipments made to the consignee.

4. Drop Shipment – You create a website and listed few things which are present in a nearby store. As soon as an order is placed on your website, you give the order to the nearby mart to deliver it to the customer’s place. Your profit is the difference between price paid by the customer and delivery+product cost of the mart. Here you do not need an inventory, in fact you do not need any store house or capital investment to start an e-commerce business

5. Groupage –  This is a method of grouping multiple shipments from different sellers (each with its own bill of lading) inside a single container. This is done when individual shipments are less than the container load or in other words are not big enough by themselves to fill up an entire container. This way, the freight cost is split between these sellers.

6. JIT – Just-in-time is an inventory optimization method where every batch of items arrives ”just in time” to fulfil the needs of the next stage, which could be either a shipment or a production cycle.

7. Landed Cost –  The total cost of ownership of an item. This includes the cost price, shipping charges, custom duties, taxes and any other charges that were borne by the buyer.

8. Waybill: A document prepared by the seller, on behalf of the carrier, that specifies the shipment’s point of origin, the details of the transacting parties (the buyer and seller), the route, and the destination address.

You can look for more definitions and KPIs related to Supply chain. But, this is a decent way to start the exploration.

We will deal with implementing a simple Supply Chain problem using PuLP in Python in our next article.

Keep Learning 🙂

The Data Monk

Data Science Terms which are often confused

Data Science = Maths+Code+Business Understanding

Many a time you come across different terminologies which sounds confusing. We will try to make them easier for you to understand and to remember

1.Data Scientist vs Data Analyst

Data Scientist helps you understand the “whatif’s” associated with a problem whereas Data Analyst gets you insights, builds report and present it to the Client.

Data Scientists mostly works around solving long term problems like building an image processor, optimizing sales route, forecasting, etc.
Whereas Data Analysts are mostly occupied with urgent requests, adhocs, etc.. They do have the liberty to work on ML/AI but it will not be the major chunk of his work.

Data Scientist tries to find answers to their own questions whereas Data Analyst answers the asked questions.

And then there are Decision Scientists 😛

2. Linear Regression vs Logistic Regression

-Linear regression is used when the dependent variable is continuous and the nature of the regression line is linear. Whereas Logistic Regression is used when the dependent variable is binary in nature.
Example – Forecasting sales of McD is a Linear problem. Forecasting if a person is depressed is a Logistic Regression problem.

-Linear Regression gives you a value whereas Logistic gives you the probability of success or failure

-The pre-requisite or assumption of Linear Regression is the “linear relationship between the dependent and independent variable”. There is no such assumption in Logistic Regression

-In the linear regression, the independent variable can be correlated with each other. On the contrary, in the logistic regression, the variable must not be correlated with each other

-Logistic Regression is used for Binary classification

3. Dependent vs Independent variable

The dependent variable is your “Y” and Independent variable is your “X” in the equation
Y = mX+constant

An independent variable is a variable that is changed in a scientific experiment to test the effects on the dependent variable.

Example – If I want to forecast the temperature of Bangalore on 15th August using the variables like Temperature of the previous week, humidity, wind speed, etc. then Y is the temperature which you want to forecast and X are humidity, wind speed, etc.

4. Forecasting vs Prediction

These two looks fairly similar but there is a striking difference between the two.
-Forecasting is a way of finding values of the future by looking at the historical data. Whereas prediction helps you in finding answers to the future.
-Forecasting is scientific whereas prediction is subjective (and vague)

Example – You can forecast the number of Biryani’s sold in Mani’s Dum Biryani on a weekend by looking at the historical data.
An astrologer can predict your future by looking at your ever-changing palm lines.

5. K-NN vs K means

– kNN stands for K Nearest Neighbor whereas K-means is K-means
– kNN is a supervised learning algorithm used for classification and regression problem. Whereas K-means an unsupervised learning algorithm used for the clustering problem
– Basically, in kNN, we set the value of K which suggests the algorithm about the number of neighbors a new dataset has to consider before classifying it into one bucket.

Example – Suppose we want to classify image into two groups i.e. Cat and Dog. Since it’s a Supervised Learning technique, so there will be some co-ordinates for the already classified images

Here, red circle is Cat, blue one is dog and the Black rectangle is the new data point. The number lines is equal to the value of K

Above we have a really messy co-ordinate system. You have already specified the value of K for this dataset i.e. Nearest Neighbor as 6. Now whenever a new data point arrives, it draws 6 connections to the nearest values. Here you can see that the number of blue circles are 4 and that of red is 2. So, the new dataset will be classified as blue.

Since, K-means is an unsupervised learning method so we don’t have a training dataset with the correct output. If you are unsure about the difference between supervised and unsupervised learning methods then go through the next point first.
K-means belongs to the family of moving centroid algorithms, i.e. at every iteration the centroid of the cluster moves slightly to minimize the objective function.

Basically, You start with placing a data point on the co-ordinates and then the next data point will be placed on the co-ordinates with respect to the previous point. Similarly, the centroid of the cluster is adjusted.

6. Supervised Learning vs Unsupervised Learning

Supervised means a thing which you can monitor. Supervised learning includes all the algorithms where you know the output of some data. You train your model on these data assuming the fact that these are correct data points. And then you build a model on top of it.

Example – We want to know the number of customers which will come to my restaurant in November. Now, I have the number of customers who have visited my restaurant in the last 3 years. So, we have some data points of the past, we can build a forecasting model using these data points and then we can predict the customers visiting in coming November.

Anything for which we know the output for a few data points will fall under supervised learning

A supervised learning needs some output to build a model. An unsupervised learning algorithm needs nothing. It will build a model on your training dataset y finding connection between different values and it will keep iterating the process until all the data points are considered. An example will help you understand better:-

Example – You have things with different geometric shape, some are circular, some are oval, square, rectangular, etc. You need to make bucket these into 4 parts. Now the algorithm which you will use does not know anything about bucketing, it only knows that you need 4 buckets. It will most probably take the first 4 items and place them on a co-ordinate. Now each object coming in will be allocated near to one of the four buckets. The algorithm will keep iterating till you are done with all the items. By then end of the run, you will have 4 buckets. This is unsupervised learning

7. Training vs Test Dataset

Suppose you have 1000 rows of data and you want to create a forecasting model. You start with LSTM algorithm.

Since forecasting is a supervised learning and you have 1000 rows of historical data, so, you will split your dataset in either 80:20 or 70:30 or any breakdown depending on the business need.

Now you would like to build your model i.e. you want to train your model to act in a similar model when you have a new dataset. The model will create a lot of “ifs and buts”, the dataset on which it will build it’s set of rules is called training dataset. This dataset gives a gist of rules governing the model.

Now you have trained your model on 80% of the data. You would not like to test the model on the real time data, right?
Because you would not like your model to break down on the new data points. So, you have to test the data on that set for which you already know the output i.e. the 20% of the data set for which you know the output but have not included in training the model.

This is not that difficult to understand once you build any statistical model.

8. Random Forest vs Decision Tree

If you have participated in Hackathons, then you must be knowing these algorithms. These two and XGB are one of the best combinations to get a good rank on the table.

– A decision tree is built on an entire dataset, using all the features/variables of interest, whereas a random forest randomly selects observations/rows and specific features/variables to build multiple decision trees from and then averages the results.

– Random Forest is made of many Decision “Trees”. So, a Decision Tree is more like a flow chart which finally gets you to a decision. When you combine these trees, you get a forest.

– The reason why Random Forest os almost 99 to 100% accurate on the training dataset is that it takes all the possible combinations to boil down to the already provided output. But it might fail to give the same result on the testing dataset whenever there is a new combination of attributes. This is called overfitting.

We will keep on updating the article.

Keep Learning 🙂

The Data Monk

Linear Regression Part 3 – Evaluation of the model

Check out Part 1 and Part 2 of the series before going further

Linear Regression Part 1 – Assumption and Basics of LR
Linear Regression Part 2 – Code and implementation of model

Accuracy is not the only measure to evaluate a Linear Regression model. There multiple ways in which you can evaluate an LR model, we will discuss four out of these:-
1. R Square
2. Adjusted R-Square
3. F-Test

SST i.e. Sum of Squares Total – How far the data are from the mean
SSE i.e. Sum of Squares Error – How far the data are from the model’s predicted values

R Squared = (SST-SSE)/SST
It indicates the goodness of fit of the model.
R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero indicating that the proposed model does not improve prediction over the mean model, and one indicating perfect prediction. Improvement in the regression model results in proportional increases in R-squared.

More the predictor, better the R-Squared error. Is this statement true? If, Yes then how to counter this?
This is true, that’s why we do not use R-Squared error as a success metric for models with a lot of predictor variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not.  It is always lower than the R-squared.

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model.

Rsquared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. Adjusted Rsquared adjusts the statistic based on the number of independent variables in the model.

Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom.

3. F-Test
The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero


Root Mean Square Error takes the difference between the predicted and actual value and square it before dividing the value with the total number of terms.

As you can observe, the RMSE penalizes the difference in prediction quite heavily by doing a square of the difference.

There are other methods also to determine the performance of the Linear Model. These three articles will definitely help you to kick-start your “modeling career”

Keep Learning 🙂

The Data Monk

Linear Regression Part 2 – Implementation of LR

We already know the assumptions of the Linear Regression. We will quickly go through the implementation part of the Linear Regression. We will be using R for this article(Read Sometimes, I am more comfortable in R 😛 )

Remember – The easiest part in any modeling project is to implement the model. The major pain point is cleaning the data, understanding important variables, and checking the performance of the model.

We will quickly go through the packages which you need, the functions and terminologies you need to understand in order to run a Linear Regression Model.

Packages in R
You don’t need any specific package to run the lm() function in R which is used to create a Linear Regression model.

Step 1 – Get your train and test dataset in a variable.
Remember – The name of columns and the number of columns in both the dataset should be same.

pizza <- read.csv(“C:\Users\User\Desktop\TDM Book\PaulPizza.csv”)
pizzaTest<- read.csv(“C:\Users\User\Desktop\TDM Book\PaulPizzaTest.csv”)

pizza contains the training dataset and pizza test contains the testing dataset

Top rows of the training dataset
Top rows of the testing dataset

You can also do a multiple fold validation to randomly decide the training and test dataset. But we will try to keep it straight and simple. So, we have manually taken the training and testing dataset. We will not encourage you to do like this. Anyways, we have the training and testing dataset with ourself.

LinearModel <- lm(NumberOfPizzas ~ Spring+WorkingDays,
data = pizza)

You are creating a LR model by the name of LinearModel. The function lm() takes the dependent variable and two independent variables i.e Spring and Working Days. The dataset is pizza i.e. the training dataset of the problem statement.

Let’s look into the summary to analyze the model


What are the residuals?
In regression analysis, the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Both the sum and the mean of the residuals are equal to zero.

What is coefficient-estimate?
It is the expected value of the number of pizzas which will be sold in a coming month, Since the accuracy of the model is too bad, the number is quite off. It shows that on an average 156691 pizzas is predicted to be sold. The second row shows the impact of each of the variable in this estimated calculation, this is called the slope term. For ex. The slope of Summer is 169.29, this suggests the effect that Summer has on the estimated value.

What is coefficient-standard error?
It suggests the average amount that the coefficient estimates vary from the actual value. The standard error can be used to compute and estimate of the expected difference.

What is coefficient-t value?
The t-value tells you how many Standard Deviation the coefficient is far away from 0. The more far it is, the easier it is to reject the null-hypothesis – i.e. we can declare a relationship between Y and x1,x2, etc.. In the above case only Working Days coefficient t-value is relatively far away from 0 which suggests a high correlation between number of pizzas sold and Working Days. This also helps us in understanding the p-value of that variable.

What should be the value of a good variable which we should include in our model?
A good variable for the model will have the following attributes:-
i.   High Coefficient – t value
ii.  Very low Pr(>|t|) value

What according to you is the most important variable in the model summary given above?
Looking at the t value and Pr value, Working Days seems to be the most relevant variable for the model

Now you need to use this model to predict the values for testing dataset which is stored in pizzatest


pred_model <- predict(LinearModel,pizzaTest)

You can check the accuracy of the values by comparing it with the actual data for these three months.

There are more ways to check the performance of a Linear Regression model which we will discuss in the next article.

Linear Regression Part 1 – Assumptions and basics of LR
Linear Regression Part 3 – Evaluation of the model

Keep Learning 🙂

The Data Monk

Linear Regression Part 1 – Assumptions of LR

Here we will talk about concepts which will let you understand Linear Regression and will help you tackle questions in your interview. I personally know a handful of candidates who have bosted themselves as a master of Linear Regression but have failed to answer basic questions(Read – Me :P)

That’s why We have tried to consolidate all the learnings from my peer group.

To understand a Data Science algorithm you need to cover at least these three things:-
1. Assumptions
2. Code and Mathematical Formulas
3. Evaluating Performance of your model

Regression is a method of modeling a dependent value based on independent variables. This model is used extensively to solve a basic forecasting problem.

Do, You want to know what was the next question asked by the panel(Lowe’s Technical Interview Round)?

What is the difference between forecasting and prediction?
These two do look the same, but there is a striking difference between them. While forecasting is based on the historical data and it’s mostly about projecting a line of future values by extending the historical trend.

Prediction is a judgment. It takes into account changes which are taking place in the future.

When you say that it will rain tomorrow, by looking at the historical data then that’s forecasting, but reading palm and telling your future is an example of prediction. So, be ready for all types of questions.

Let’s get back to Linear regression. In a layman term, we can say that Regression is a method to predict the outcome of a variable in the best possible way given the past data and its outcomes.

Example – Can you forecast the salary of the age group 40-45 using the given data?


You can guess that the Salary would be somewhere in the range of $4,000 – $4,500 looking at the already given outcomes. This is how a basic Linear Regression works.

The core of Linear Regression is to understand how the outcome variable is dependent on the independent variables.

What are the assumptions of Linear Regression?

This is one of the most asked questions in an interview which revolves around Linear Regression. There are five major assumptions of a Linear Regression:-
1. Linear Relationship
2. Low or No Multicollinearity
3. No autocorrelation
4. Multivariate Normality
5. Homoscedasticity

You don’t have to learn these points, you need to understand each of these before diving in the implementation part.

1. Linear Relationship – A linear regression assumes that there is a straight line relationship between X and y in the equation given below

Y = Bo + B1X + Epsilon

Y = Bo + B1X is nothing but an equation of straight line

To do any statistical inference, we need to make some assumptions about the error term which is represented as Epsilon. The first assumption comes into the picture where we assume three things for this random error terms:-

a. Mean of the error is 0
b. Error is normally distributed
c. Error has a constant variance

Read the following line at least thrice to understand it

” Error Term ~ N(0, Variance) = This shows that every error term is normally distributed and have a mean of 0 and a constant variance”

Remember the equation Y = Bo+B1X+Epsilon ??

Now we can re-write the equation as Y ~ N(Bo+B1X,Variance)

This means that the Y is normally distributed with a mean of Bo+B1X and a constant variance.

So the first assumption goes like this “The dependent variable is a linear combination of the independent variable and the error term”

You can check the relationship between the dependent and independent variable by a scatter plot

A low or little linearity present in the dataset

2. Multivariate Normality

The second assumption states that every Linear combination of Y with the independent variables needs to have a univariate Normal distribution. Multivariate normality is not a fancy term, it is just a generalization of the one-dimensional normal distribution to higher dimensions

3. Low or No Multicollinearity

This one is easy to understand. Suppose I want to find out the selling price of a car and your independent variables are age of the car, Kilometers, health of engine, etc.

Now, we know that the number of Kilometers and age of the car will have high correlation(generally). The number of kilometers traveled by the car will increase with the age of the car.

Linear Regression states that using two variables with high correlation will complicate the model and will be of no use. So you need to chuck one of these. We will talk about two ways in which you can decide on removing one of the variables from age and kilometer

1. Variance Inflation Factor (VIF)
VIF > 100 is a direct indication of high multicollinearity. In a layman term, remove the variable with high VIF value

2. Correlation Matrix – Plot a correlation matrix to understand the strength of correlation between two variables. Take one out of two variable at a time and check the performance of the model

4. Low or No Correlation between the data

In the last point we talked about the multicollinearity between the independent variables. Now, we want to check if there is a correlation between the data itself.

In simple language, if the value of f(x+1) is dependent on f(x) then the data is having a correlation. A classic example is the share price where the price is dependent on the previous value. You can check the correlation of the data by either a scatter plot or a Durbin-Watsom test. The null hypothesis of Durbin-Watsom test is that “the residuals are not linearly correlated”

If 1.5<d<2.5 then the values are not auto-correlated.

See, if the data itself is correlated then it would be hard to know the impact of other variables on Y. So it is assumed that there is no or little correlation between the data

5. Homoscedasticity

If the residuals are equal across the regression line, then there is no homoscedasticity.

We will look into the implementation part in the next article which will be followed by the evaluation of performance.

Linear Regression Part 2 -Implementation of LR
inear Regression Part 3 – Evaluation of the model

Keep Learning 🙂

The Data Monk

Guesstimate – Price of one Kilogram Potato in India ?

Let’s try to guesstimate the price of one kg potato in India. There could be multiple ways to do it, but your aim should be to keep the “scope of error” to the minimum.

To estimate the price of one kg potato, you can take up any food item which has potato in it.

You can estimate the price by using french fries, Potato burger, etc.

I will choose the classic Indian snack to guessimate the price of a kg of potato

Why Samosa and why not french fries?

Advantage of choosing Samosa –
-We know the lowest price of Samosa i.e. Rs. 10 from any roadside vendor
– You can easy estimate the operational cost and profit of a roadside vendor

Disadvantages of French fries:-
– The price range of French fries is huge, which ranges from Rs. 50 to Rs. 200
– If you want to estimate the price of potato using French Fries from McDonalds then you need to estimate the operational cost.

We take guesstimates to reduce the error percentage and come to a close number.

– One samosa weighs 150 gms
– Price of one samosa is Rs. 10

We will divide the price of a samosa in 4 parts
– Profit
– Cost Price of Potato
– Price of Flour, oil,spice and gas
– Labor cost(if any)

We can safely assume that flour will weigh around 30 gm in each samosa and the rest will be potato, so we have 120 gms of potato

Profit – 30% (You can take it in any range between 40 to 20% because it seems logical to expect such a high percentage of profit considering the fact that the Selling Price is on the lower side)

Profit – Rs. 3

Flour, oil, spices, and Gas – Flour and oil used by the road side vendors are generally of cheap quality, So we can assume an investment of 20% in these items. Rs. 2 for other ingredients

Operational Cost – There will be some operational cost like salary to one or two permanent staff and rent of the place. Let’s take it as 20% i.e. Rs. 2

Now we are left with only Potato, which is the main ingredient of Samosa

Rs. 3 for 120gms of Potato. Use unitary method to solve it further

120 gm = Rs. 3
1000 gm = Rs((3/120)*1000) = Rs. 25 per Kg

Roadside vendor mostly gets potato from whole sale. So, you can also state that the normal market rate of one kilogram of potato is 1.2 times the whole sale rate

Rs.(25*1.2) = Rs. 30 per Kg

Enjoy your samosa /_\

Happy Learning

The Data Monk

The measure of Spread in layman terms

Data Science is a combination of Statistics and Technology. In this article, we will try to understand some basic terminologies in Layman’s language.

Suppose I run a chain of Pizza outlets across Bangalore and have around 500 delivery boys. We have assured “less than 30 minutes delivery time” to our customers, but while going through the feedback forms, We can feel that the delivery executives are taking more than the promised time.

NULL hypothesis – The delivery time is less than 30 minutes. It is represented by Ho

Alternate Hypothesis – The delivery time is not less than 30 minutes or it is more than 30 minutes. It is represented as Ha.

We mostly try to test the Null hypothesis and see whether it’s true.

Population – Your total population is 500, which is the number of delivery boys

Sample – It’s not feasible to test the delivery time of each delivery boy, so we randomly take a small fragment of the population which is termed as Sample

You must have heard the term that a ‘p-value of 0.05 is good’, but what does that actually mean?

p-value helps you in determining the significance of your result in a hypothesis test. So, when you say that the p-value of the test is less than 0.05 then you sound like “There is strong evidence against your Null Hypothesis and you can reject it”

Similarly, when the p-value is significantly more than 0.05 then the Null Hypothesis stays strong as there is weak evidence against the Null Hypothesis.

In a layman’s term, if the hypothesis testing results in p-value less than 0.05 for the case mentioned above then we will be rejecting the null hypothesis by saying that the average amount of time to deliver a pizza is more than 30 minutes.

You must have got a fair bit of idea about population, sample, null hypothesis, alternate hypothesis, and p-value.

Let’s get back to sampling. There are four methods to get a segment out of a population i.e. sampling of a population:-

a. Random Sampling – Completely random selection
b. Systematic Sampling – A systematic way to sample a population, like taking the kth record from the population
c. Stratified Sampling – Suppose the complete population is divided into multiple groups, so stratified sampling will take a sample from each group. This reduces the biasness of the sample.
If we have a data set of people of different age group then a random sample might be biased towards a particular group. But, stratified sampling takes care of this
d. Cluster – When a population is divided into different clusters then we need to get an equal sample from each of these

We have the data set i.e. the population and we have taken a sample from it.
Now, we need to know the spread of the sample or the population.

I assume that you already know about mean, median, mode, etc.

The measure of spread describes how similar or varied a set of observed values are for a variable. The measure of spread includes:-

a. Standard Deviation
b. Variance
c. Range
d. Inter Quartile Range
e. Quartile

You can easily find a copy-book definition on the internet. Let’s try to understand it in simple terms.

Mean gives you an idea of average of the data.
Suppose the average salary of 5000 the employees at Walmart is $100,000.

The variance will give you an idea about the spread of the salary i.e. how far is your data point from the mean. We calculate Variance on either the complete population or the sample population.

Both the formulas are almost the same, the only difference is the denominator. If you just want to memorize the formulas then also it’s fine. But, to understand the denominator, you need to go through the concept of degree of freedom. Let’s try

Degree of freedom is nothing but the number of observations in the data that are free to vary when estimating a parameter. In simple words, if you know the mean or average of 5 numbers, then all you need to know is 4 numbers and you can easily get the 5th number. Here the degree of freedom is n-1 i.e. 4. This is an example of the degree of freedom.

Now the reason why Population variance has N and Sample variance has N-1 as the denominator?

When we have a population size of 1000 and we have to calculate Population variance then we already know the mean of the Population, thus we divide it with N.

Read this loud – When we only know the mean of the sample, then we divide the value with N-1 to “compensate on the fact that we don’t have concrete information about the population, thus we try to keep the overall value larger by dividing it with N-1”

Quartile is a number and not a group of values. It’s more like a cut-off value. You have 3 quartiles in statistics

Quartile 1 – Also called 25th percentile. This is the middle number between the smallest number and the median of the data set.

Quartile 2 – Also called the median and the 50th percentile

Quartile 3 – Also called the 75th percentile and upper quartile. This is the middle value between the median and the highest value of the data set

To find the quartiles of this data set, use the following steps:

  1. Order the data from least to greatest.
  2. Find the median of the data set and divide the data set into two halves.
  3. Find the median of the two halves.

Interquartile Range = Q3-Q1
It is the midspread or the middle 50 percentile of the dataset. It’s also value and not a group of numbers

Bonus knowledge

How to identify an outlier?

A basic rule is that whenever a data point is more than 1.5 times of the third quartile or less than 1.5 times of the first quartile, then it’s termed as an outlier.

Ek example dekh lo, samjh aa jaeyga

Numbers – 2,4,5,7,9,11,13

Median = 4th term as we have 7 terms and the numbers are arranged in ascending order. Thus median(Q2) is 7

Quartile 1(Q1) = Median of the dataset containing the lower half of the data i.e. calculate the median of 2,4,5. Thus Q1 will be the 2nd term i.e. 4

Quartile 3(Q3) = Median of the upper half of the data i.e. median of 9,11,13. Thus median is 11

(2,4,5),7,(9,11,13) ~ (Q1),Q2,(Q3)

Inter Quartile Range = Q3-Q1 = 11-4 = 7

Keep Learning

The Data Monk