Home » Statistics gyan

# Category Archives: Statistics gyan

## Data set and type of data sets for modeling

Q.) What is data set?

A.) Data set is a complete data which you use for your project. Dataset includes data from multiple data bases and tables combined together. The dataset for modeling can be divided into 2 parts- Train and Test dataset

Q.) What is train dataset?

A.) When you are building a model, you use some part of the dataset to train your model. This train dataset sets an example for your model to help it behave in a consistent manner.

For example, if you have a restaurant data for last 13 months, then this is your complete data set. You can divide the dataset in 80:20 ratio and can take around 11 months of data for training your model. You build this model on the above 80% dataset

Q.) What is test dataset?

A.) Now the rest 20% of the data is used to test your model before getting into the real time. Taking the above example forward, the last 2 months of data out of the 13 months, will not be used in training the model. So what ever prediction your model is doing will be tested against already known values for the last 2 months, to check the effectiveness of the algorithm

Q.) What comes under implementation of models?

A.) Implementation mainly means to understand the requirement of the stakeholders and to mold the model to meet the business requirement. For example. You can build a forecasting algorithm in R, but then you might have to implement it in PowerBI(Business Intelligence tool from Microsoft) to make it more consumable or you might have to develop an app to meet the requirement, etc.

Q.)Why data cleaning plays an important role?

A.) We are back to cleaning data. Once I participated in one of the Kaggle competition which required applying different text analytics algorithm to see sentiment of the text. I had done a similar project in the past on a clean data and I had the code ready for it. But, it took me almost a couple of days to clean the data and only a couple of hours to run the model.

The reason why cleaning is important is because you won’t get a good result on a dirty dataset and chances are that you might reject a particular algorithm just because it does not show you expected result, while on the other hand the algorithm was correct but your unclean data was running the case here

For more such questions, go here

## Statistics gyan

What is confusion matrix?

Confusion matrix is a 2×2 matrix consisting of True/False and Positive Negative. This matrix is typically used in prediction world to understand the effectiveness of an algorithm. The first part of the object is Actual and the second part of each object if Predicted. In True-Positive object, the first True is for actual and the Positive is for Predicted

True – Positive | True – Negative |

False – Positive | False – Negative |

Q.) What is True-Positive?

A.) This means Actual value is true and predicted is also positive. Example, if we have to predict the disease whether present in a patient using some model. Then, the 1^{st} block suggests the cases for which we predicted yes and they actually were suffering from the disease.

For a predictive model or a classifier – This value should be high

Q.) What is True-Negative?

A.) This means Actual value is true and predicted is negative. Example, We predicted that a patient is not suffering from a disease and he is found not suffering as well.

For a predictive model or a classifier – This value should also be high

Q.) What is False-Positive?

A.) This is also known as Type-1 error. Here we predicted yes, but they don’t actually have the disease.

This indicates an error in your algorithm. And since we almost always deal with sensitive data, so this value should be as low as possible. Suppose we predicted that a patient is suffering from diabetes and the doctor prescribed based on our algorithm and later found out that the patient was not suffering from diabetes. So this will raise concern

Q.) What is False-Negative?

A.) This is also known as Type-2 error. Here we predicted no, but they actually have the disease.

This is the major concern of an algorithm. No matter how accurate the model is, if the accuracy for False-Negative is low, then the model should not be introduced.

This indicates an error in your algorithm. Suppose we predicted that a patient is not suffering from Cancer and later found out that the patient was suffering, then there is not much use of the algorithm.

Let me know if you need more example to understand this. For more such question, go here

## Basic statistics terms definitions in layman language

A **population** is any specific collection of objects of interest.

A **sample** is any subset or subcollection of the population, including the case that the sample consists of the whole population, in which case it is termed a census.

A **measurement** is a number or attributes computed for each member of a population or of a sample. The measurements of sample elements are collectively called the sample data.

**“N”** is usually used to indicate the number of subjects in a study. Example: If you have 76 participants in a study, N=76.

A **parameter** is a number that summarizes some aspect of the population as a whole. A statistic is a number computed from the sample data.

**Quantitative data** are numerical measurements that arise from a natural numerical scale.

**Statistics** is a collection of methods for collecting, displaying, analyzing, and drawing conclusions from data.

**Correlation –** It is the degree to which two factors appear to be related. Correlation should not be confused with causation. Just because two factors are reported as being correlated, you cannot say that one factor causes the other. For example, you might find a correlation between going to the library at least 40 times per semester and getting high scores on tests. However, you cannot say from these findings what about going to the library, or what about people who go to libraries often, is responsible for higher test scores.

**Median** – The score that divides the results in half – the middle value.

**Descriptive statistics** is the branch of statistics that involves organizing, displaying, and describing data.

**Inferential statistics** is the branch of statistics that involves drawing conclusions about a population based on information contained in a sample taken from that population.

**r-value** is the way in which correlation is reported statistically (a number between -1 and +1). Generally, r-values should be >+/-.3 in order to report a significant correlation.

**Qualitative data** are measurements for which there is no natural numerical scale, but which consist of attributes, labels, or other nonnumerical characteristics.

Stay tuned to our website for more Statistics gyan..For puzzles, case studies and statistics question :-

100 puzzles and case studies to crack data science interview

## What is the difference between data, information, knowledge and Wisdom?

Hi Monk,

The aim of The Data Monk is to help people know that statistics is not a rocket science and you don’t need an economics or statistics degree to make some sense out of the data.

**Remember – Even if your instinct says that the coming month sale will be double the sale of the last month and you are 110% sure, despite that you need to back up your data with maths to gain trust of your manager, client and most importantly of yourself. **

If you are looking forward to make a career in data science, then don’t lose contact with Statistics.

*What is data?*

It’s really important to understand the definition rather than learning one. Data is any number or character.

For example 51, greatest, batsmen, etc.

**What is information?**

Information is something which you infer from data.

Example – Sachin Tendulkar has scored 51 ODI centuries and is one of the greatest batsman in the World

Here we took some data points to construct a meaningful sentence

**What is knowledge?**

Knowledge is the mixture of data and information. It helps you understand information better.

Example – Sachin is one of the best batsman of all times. This is you knowledge from the fact that he has score 51 centuries in ODI

**What is Wisdom?**

Wisdom is your synthesis. What you understand from data, information and your knowledge.

Example – When India is chasing a large target then ask Sachin to open the innings and when Inidia is batting first on a fast wicket then ask Him to come at number 4. This is wisdom

**Knowledge** is the accumulation of facts and information. **Wisdom** is the synthesis of **knowledge** and experiences into insights that deepen one’s understanding of relationships and the meaning of life. In other words, **knowledge** is a tool, and **wisdom** is the craft in which the tool is used.

If you have any questions on this line, then do let us know. And remember Sachin is one of the greatest cricketer of all times

Thanks,

The Data Monk