If you are into Data Science, then you must have heard about p-value. I could have started it with a very superficial definition strolling around probability and significance and null hypothesis, etc. But that’s already there on multiple blogs. We want to simplify this term in order to make you “understand” rather than remember things. We will start with null hypothesis. What is null hypothesis?
So, Nitin was the monitor of Class VIII B, He has one job i.e. to write the name of those classmates who used to make noise in the absence of the teacher.
One day he wrote the name of Tahseen on the black board. Teacher asked Tahseen whether he was making any noise? As usual Tahseen denied. Now, the teacher had to either believe the monitor or Tahseen.
He assumed that Tahseen did not make the noise, why? because it’s easier to disprove this.
See, it’s always to disprove something with an example than to prove something. Example, If the teacher catches Tahseen making noise then the Null Hypothesis i.e. Tahseen did not make the noise will be dispropved.
But if we take the null hypothesis as “Tahseen made noise” and you did not catch him making noise on an instance then that does not mean that the null hypothesis is proved.
Coming back to the question Teacher had this null hypothesis – Tahseen did not make the noise Alternate hypothesis – Tahseen made noise
Now again the next day Nitin complained that Tahseen was making noise which was again denied by Tahseen.
On the next three days also his name was written on the black-board. Now the teacher has reached a threshold where he can say with confidence that “Dude, you were making noise because you have reached a benchmark of complains and it is statistically significant to prove that my null hypothesis was wrong. Thank you Nitin :)”
This statistical significance is p-value which is nothing but a benchmark set before starting the experiment.
In general a p-value <0.05 is treated as statistically significant which means that there is 95% confidence of rejecting the null hypothesis.
I have appeared for a ton of interviews and it’s very hard to dodge this question.
What is Confusion Matrix? Confusion Matrix is a performance measuring technique for ML Classification model.
Why do we need Confusion Matrix? Is measuring accuracy not enough? Confusion Matrix suggests the actual accuracy of your model. For example. Suppose I want to classify if a person is suffering from a very rare disease(1/100000). Even If i build a very bad model and label everyone as ‘suffering’ from the disease, then also the accuracy of the model will be somewhere around 99%. But that model is of no use because it is unable to solve the problem of classification. Here comes the confusion matrix which is a 2×2 matrix of predicted and actual values
Here the columns denote the Predicted values and rows denoted the Actual values.
Take example of a fire alarm True Positive – Prediction is true i.e. there is a fire in the building and there is actually fire in the building. That’s fine False Negative – There is ‘actual’ fire in the building but your model suggests that the alarm is ringing in vain. This is catastrophic, the same is with the disease example, i.e. the person is infected but the model is unable to identify. It’s type II error False Positive – The building is not on fire but the model suggests that it is on fire. This is still acceptable, matlab kaam chal jaeyga. Example. The person is not infected by the virus but your model suggests that it is, you will go for a few check-ups and will confirm that you are safe :). This is Type I error True Negative – No fire, no alarm – All chill
Accuracy = (TP+TN)/(TP+TN+FP+TN) Precision = TP/(TP+FP) i.e. the accuracy of the positive prediction Recall Sensitivity = TP/(TP+FN) i.e. Coverage of actual positive results Specificity = TN/(TN+FP) i.e Coverage of actual negative results
When is precision more important than recall? Suppose there is a Zombie apocalypse, in that case you want to put as many normal person as possible. But even a single infected person is dangerous, so you look for high precision i.e. less False positive cases
We often come across few terms which sounds no different but are poles apart. The same goes with Data Science, Big Data,Data Analytics, and Business Analyst. So if you are confused about the role which an employer is offering you, then this article is for you.
Data Science vs Big Data vs Data Analytics vs Business Analyst
Data Science deals with a lot of mathematics. This domain makes sure that you are sound in statistics and model implementation.
Requirement – Good in mathematics, Complete hands on Python/R, expertise in at least a couple of algorithms (Predictive modelling, NLP,Clustering, Neural Network, etc.). A degree in Mathematics/Statistics definitely helps. One of the best Data Scientist of India Rohan Rao has done his maters in Statistics from IIT Bombay
Things to do to become a Data Scientist – Conc. on Algorithms and Hackthons. Make your own winning combination and don’t forget to use XGBoost 😛
Big Data Specialist- It is a humongous amount of data stored in one place. A big data specialist knows the technology which would collapse in the future. He/She makes scalable infrastructure to cater high volume of data
Requirement – A Big Data Specialist should have good amount of experience in handling Multi TB per day data. This definitely comes with experience and you can’t learn it in a classroom course. A Big Data Specialist should have sound knowledge of building data pipelines, deploying algorithms/solution curated by Data Scientists and make the life of Data Analyst easier 😛
Data Analyst – A Data Analyst works on providing valuable insights to the business. SQL is the bread and butter of a Data Analyst. He is responsible to write optimised and efficient codes to cater the Business Requests.
A Data Analyst should have a decent knowledge of Data Science algorithms which would help him in understanding the data and providing meaningful insights. A little amount of mathematics never hurts.
Business Analyst – The job of a business analyst is to consume the solution provided by DA,DS, and BDS. He should have a decent knowledge of SQL and MS Excel in order to churn the numbers. Above all he should be able to consume the insights and take decisions based on the data
Requirement – Knack to solve complex business problems, SQL, MS Excel, and good communication skill
Salary wise (Person with 3 years of experience)
Data Science > Big Data Specialist > Data Analyst ~ Business Analyst (This is too subjective and highly debatable)
Do look into the Job Description and Profile offered before going for an interview 🙂