Supervised Learning Questions. This is the 4th part of the Supervised Learning Questions. Do go through the first 3 parts.
Part 1-https://thedatamonk.com/supervised-learning-interview-questions/
Part 2 – https://thedatamonk.com/supervised-learning-interview-question/
Part 3- https://thedatamonk.com/supervised-learning-data-science-questions/
1.We will try to build a basic KNN model with 5 neighbors. Write the code for the same.
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X, y)
y_pred = classifier.predict(X_test)
2. Let’s create a confusion matrix to see how did the model perform?
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
As you can see the accuracy of the model is 100%. This is because the number of training and testing dataset is very less. Once you try your hands with a larger dataset, then only you will be able to check the performance of other models.
3. Don’t get confused between KNN and K-means algorithm. What is the difference between the two?
K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.
The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t — and is thus unsupervised learning.
After going through the above algorithms, you must have got a good idea about implementing these algorithms on different datasets provided in various Hackathons. Go through Kaggle or Analytics Vidhya to participate in Live and Past Hackathons.
Apart from implementation of these 5 algorithms, we also need to know the evaluation metrics, some definitions used in supervised learning, etc. All of these will be covered below
4. What is ROC curve?
The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).
5. Define Precision and Recall.
Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.
6. What is the difference between L1 and L2 Regularization?
L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
7. What’s the difference between Type I and Type II error?
Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is. A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.
8. What’s the trade-off between bias and variance?
Bias is an error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.
Variance is an error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data.
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.
9. How is a decision tree pruned?
Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.
Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy
10. Which is more important to you– model accuracy, or model performance?
Model accuracy is a very misleading parameter to judge a model. A model could be useless even after having 99% accuracy. Suppose you are creating a model to classify a very rare disease as whether a patient is infected by that disease. Then even if you tag every patient as “not infected” then the model will have more than 99%. But this model is not at all useful. So, the model performance is the best matrix to judge the working of a model.
11. What’s the F1 score? How would you use it?
The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.
12. How would you handle an imbalanced dataset?
An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:
1- Collect more data to even the imbalances in the dataset.
2- Resample the dataset to correct for imbalances
3- Try a different algorithm altogether on your dataset
13. When should you use classification over regression?
Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)
14. Name an example where ensemble techniques might be useful.
Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce over fitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).
You could list some examples of ensemble methods, from bagging to boosting to a “bucket of models” method and demonstrate how they could increase predictive power.
What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
15. How do you ensure you’re not overfitting with a model?
This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.
There are three main methods to avoid overfitting:
1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
2- Use cross-validation techniques such as k-folds cross-validation.
3- Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.
16. What evaluation approaches would you work to gauge the effectiveness of a machine learning model?
You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data.
You could use measures such as the F1 score, the accuracy, and the confusion matrix. What’s important here is to demonstrate that you understand the nuances of how a model is measured and how to choose the right performance measures for the right situations
17. How do you handle missing or corrupted data in a dataset?
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
18. What happens when you take large value for K in KNN algorithm?
A large value of K in KNN algorithm makes it completely expensive. It means that you are creating large clusters.
19. What happens when you take smaller value for K?
A small value of k means that noise will have a higher influence on the result
20. How to select the optimum value of k in knn?
Thereafter methods which are used to identify the correct or optimal value of k in knn algorithm. The methods are:–
a. Elbow method
b. Cross Validation method
21. What is a cross-validation method?
Cross-validation can be used to estimate the test error associated with a learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
22. How does KNN algorithm works?
KNN works by analogy. The idea is that you are what you resemble.
So when we want to classify a point we look at its K-closest (most similar) neighbors and we classify the point as the majority class in those neighbors.
KNN depends on two things: A metric used to compute the distance between two points and the value of “k” the number of neighbors to consider.
When “k” is a very small number KNN can over fit, it will classify just based on the closest neighbors instead of learning a good separating frontier between classes. But if “k” is a very big number KNN will under fit, in the limit if k=n KNN will think every point belongs to the class that has more samples.
KNN can be used for regression, just average the value for the k nearest neighbors or a point to predict the value for a new point.
One nice advantage of KNN is that it can work fine if you only have a few samples for some of the classes.
23. What is the difference between binary classification and multi-class classification?
In a binary classification model we need to classify the output in only two types Like Typhoid or normal, Male or Female, Survived or Not, etc.
In multi class classification we need to classify the output in more than two types. Like, Types of flowers or types of animals, etc.
24. Write a program to do cross validation for knn
# creating odd list of K for KNN
from sklearn.model_selection import cross_val_score
myList = list(range(1,50))
# subsetting just the odd ones
neighbors = filter(lambda x: x % 2 != 0, myList)
# empty list that will hold cv scores
cv_scores = []
# perform 10-fold cross validation
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring=’accuracy’)
cv_scores.append(scores.mean())
25. Guess the case here:-
a. If a person will purchase a new home or not – Classification
b. Number of bikes rented in a month – Linear
c. Whether a person has diabetes – Classification
26. What is reshape function?
NumPy provides the reshape() function on the NumPy array object that can be used to reshape the data. The reshape() function takes a single argument that specifies the new shape of the array. It is common to need to reshape a one-dimensional array into a two-dimensional array with one column and multiple arrays. It gives a new shape to an array without changing its data
(123) becomes (123,1) if we use the code y.reshape(-1,1)
27. What is correlation? How can you find the correlation of variables on a data frame?
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.
Df.corr()
28. How to explore a dataset in Python?
There are multiple commands which can help you in exploring a data set. Following are a few commands:
.info()
.describe()
.head()
29. What is loss function?
At its core, a loss function is incredibly simple: it’s a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. As you change pieces of your algorithm to try and improve your model, your loss function will tell you if you’re getting anywhere.
30. Why can’t we use 1000 fold to get the best accuracy?
More folds results in more computational expense. The sample code for cross_val_score is given below:
from sklearn.model_selection import cross_val_score
reg = linear_model.LinearRegression()
cv_results = cross_val_score(reg,X,y,cv=5)
Try to solve these questions, other members will evaluate your answer and provide sufficient support.
inMobi- https://thedatamonk.com/inmobi-data-science-interview-question/
HSBC– https://thedatamonk.com/hsbc-business-analyst-interview-questions/
OLA– https://thedatamonk.com/ola-data-analyst-interview/
Big Basket- https://thedatamonk.com/big-basket-data-analyst-interview-questions/
Swiggy – https://thedatamonk.com/swiggy-data-analyst-interview-questions/
Accenture – https://thedatamonk.com/accenture-business-analyst-interview-question/
Deloitte – https://thedatamonk.com/deloitte-data-scientist-interview-questions/
Amazon – https://thedatamonk.com/amazon-data-science-interview-questions/
Myntra – https://thedatamonk.com/myntra-data-science-interview-questions-2/
Flipkart – https://thedatamonk.com/flipkart-business-analyst-interview-questions/
SAP – https://thedatamonk.com/sap-data-science-interview-questions/
BOX8 – https://thedatamonk.com/box8-data-analyst-interview-questions-2/
Zomato – https://thedatamonk.com/zomato-data-science-interview-questions/
Oracle – https://thedatamonk.com/oracle-data-analyst-interview-questions/
Try to cover all the four parts of this series
Keep Learning 🙂
XtraMous