Structured Query Language is one of the most important rounds in a Data Scientist interview. You need to show your expertise in SQL. In this article, we will start with the basic questions and will look into some of the most asked questions.
P.S. – We will avoid the definitions as these are not asked much in the interviews
Exponential Smoothing is a technique to make forecasts by using a weighted mean of past values, wherein more recent values are given higher weights.
Now, let’s try to unpack that statement and actually understand it.
Background: We have a ‘time series’ of values, typically taken at equally distant time intervals. As an example, think of the Quarterly Sales numbers for some product for the last 5 years. So we have a total of 20 sales numbers. How to use that to forecast the expected sales for the next (future) quarter?
Before we get to Exponential Smoothing, let’s understand simpler options first.
Simple Mean: We can simply average these 20 sales numbers. That would be a valid forecast. But it will miss trends and seasonal variations in our data.
Next, we could try a (Simple) Moving Average. For this technique, we pick a “window size” k. Say we decide on k=4 time periods. The SMA technique is nothing but the average of the first 4 numbers, then the set of numbers 2-to-5, then 3rd-to-6th values and so on. Doing this has the effect of “smoothing” out our sales time series because the impact of one strong (or weak) quarter gets mitigated by its other neighboring values. Because we are always dividing by 4 (our chosen window size) this is also often called “Equally-weighted” Moving Average.
A natural thing to try next is to give the more recent values some “extra weight” because the latest sales figures could be reflecting more important recent trends. This is where Exponential Smoothing comes in. For this technique we have to pick a factor (called the smoothing constant) alpha, a number between 0 and 1. [Alpha closer to 1 means we are giving a lot of importance to the most recent values.]
Smoothed Value(at time t) = Actual value at time t * alpha + Smoothed Value at the previous time t-1 * (1 – alpha)
Notice that the formula above for Exponential Smoothing is recursive. This is neat, because it means that all the past values do end up playing a role in the forecast, albeit with decreasing importance as the values get ‘older’ in time.
With this background, you can read up on exponential smoothing techniques. And if you are using R, try the ses() command on a small series, and trying changing the alpha values. That is one good way to develop intuition about ETS forecasts.
Web Analytics is a complete world in itself. It makes life easier for the client as well as the developer.
The Fortune 500 companies spent a hell lot of money to get a handful of insight. We will try to provide case studies and real web analytics problems to get you insight about the job and its impact.
Well Web Analytics is the study of online/offline patterns and trends. It is a technique that you can use to collect, measure, report, and analyze your website data. It is normally carried out to analyze the performance of a website and optimize its web usage. Web analytics is used to track key metrics and analyze visitors’ activity and traffic flow. It is an approach to collect data and generate reports.
Why web analytics? Web analytics covers a huge spectrum. It provides direct insight to you about how your website is working and what the customers are saying about you. Following are the ways in which web analytics help you grow your business:-
a. Evaluating the web content quantitatively b. Web analytics help you get comparative analysis c. Create a hypothesis, test and evaluate them d. Helps stakeholder and content owners e. Several other usages of the web analytics involve, customer bounce rate, conversion rate, fall out reports, etc.
Basically, there are three types of web analytics metrics and every term used falls under one of the three buckets.
1.Count – It contains a number. It could be anything like number of
customers, number of unique visitors, number of sales/conversion. It could be
an integer or a decimal
2. Ratio – As the name suggests, it’s a simple ratio of either two numbers or two ratios or any combination of the two in numerator and denominator.
Example. Unique visitors per
3. Key Performance Index/Indicator – Also known as KPI. KPI could be Count or Ratio. The reason why it is called Key performance index is because the client/business decides it’s KPI. KPI is a measurable value that demonstrates how effectively a company is achieving key business objectives. Organizations use KPIs to evaluate their success at reaching targets
Let’s start with the terminologies:- 1. Visitor – A visitor is anyone who visits a website. If you purchased this book from Amazon then you were a visitor of that website.
2. Visit – A visit is an interaction, by an individual, with a website consisting of one or more requests for an analyst-definable unit of content (i.e. “page view”). If an individual has not taken another action (typically additional page views) on the site within a specified time period, the visit session will terminate.
3. Page – Page is a dimension and it is an analyst-definable unit of content. The web analytics provider asks the client about the features they want to include in order to term a page as a page.
4. Page views – Basically it’s the number of times a page have been viewed. Web server responses returning status codes indicating the requested content was missing (400 to 499) or there was a server error (500 to 599) should not be counted as a page view unless the web server has been configured to return a real page in the same response with the status code.
5. Session – Different analytical tool provider uses different methods to calculate the duration of a session. It’s usually for 30 minutes. Basically, a session is the time of inactivity after which a visitor have to log again in order to access the website.
Company Name – Deloitte Location – Bangalore Position – Data Scientist
Number of Rounds – 4 Round 1 – Technical Interview (R/Python) Round 2 – Case Study and Guesstimate Round 3 – Project Discussion and Technical Round 4 –HR Round
Round 1 – R/Python Technical Interview
I did my project in R, so I opted for this one only. The questions ranged from beginner to intermediate level. The interview went for 45 minutes and following are the questions which were asked
a. What are the basic data types in R? b. What are the Data Structures in R and which one is your favorite? c. What are the ways in which you can combine multiple sets into one? d. What is the use of stringAsFactors set as False? e. Write a function to convert Fahrenheit to Kelvin. You can use the internet to find the conversion. f. Now turn Kelvin to Celsius g. Write a function to print squares of numbers in sequence. h. What is reshaping in R? i. What is the function of unlist()? j. Is an Array a Matrix or a Matrix an Array? k. What is the difference between ‘%%’ and “%/%”? l. What is the difference between subset() function and sample() function in R? m. What is the output for the below expression all(NA==NA)?
Round 2 – Case Study and Guesstimate
Guesstimate Topic – Number of Laptops sold in Bangalore in one day Business Case Study – How do you think TVF makes a profit? Did moving to its own website advantageous to TVF?
1. So, What was one major project you did in the current account? 2. What all things are you people monitoring? 3. Tell me one important thing which you people are working on right now? 4. What is Cannibalization? 5. What was the approach to test Cannibalization in a digital platform it? 6. What are a repeat and a return visitor in web analytics? 7. What were the main metrics which you had in your dashboard? 8. What are organic customers? 9. If a customer logs in both from his mobile and laptop, will it count as two Unique customers? 10. How do you find out if there is a bot surfing the website? Basically, what is bot traffic? 11. What is A/B Testing? 12. Give an example of A/B Testing 13. What is CTR? 14. How do you measure Click Through Rate?
Round 4 – HR Round
A formal discussion on the culture of the company, my reasons for leaving the current company, salary, etc.
Company Name – Accenture Location – Bangalore Position – Business Analyst
Number of Rounds – 3 Round 1 – Technical Round(SQL and R) Round 2 – Guesstimate Round Round 3 – Project and HR Round
Round 1 – SQL and R This was a face to face interview where questions were mostly asked around the basics of SQL and R. The interview lasted for 1 hour and following are the questions asked.
1. What is the use of NVL function in Oracle? 2. What is Correlated Subquery? 3. What is the difference Between UNION and UNION All? 4. Explain different Joins in SQL using avenn diagram. 5. What is the result of following query? select case when null=null then ‘Amit’ else ‘Rahul’ end from dual; 6. What is parser? 7. can we have another column in a table other than a primary key column which will act as a primary key? 8. If a emp table is having duplicate emp_id then can we make it primary key? 9. How the triggers will execute if two or more triggers? 10. What is lapply and sapply? 11. Data.table vs. data.frame 12. [a-z A-Z 0-9 -.] what does this regex means?
Round 2 – There were just two Guesstimate Questions in this round
Guesstimate 1 – How many digital watches are sold per day in India The solution is in the book. For guesstimate practice, you can click here
Guesstimate 2 – What is the size of the market for disposable diapers in India? The Solution is in the book. For guesstimate practice, you can click here
Round 3 – Project and HR
My project was on using Linear Regression to predict the number of Solar Energy Systems the Client need to deploy to optimally meet the requirement
The questions asked were mostly on Linear Regression, measuring the accuracy of the model, and it’s implementation. Some of the questions asked in the interview were:-
1. Why did you use Linear Regression? 2. How many independent variables were there? 3. How did you figure out the important variables to consider in the model? 4. What is the use of p-value? 5. How will you make a layman understand Linear Regression? 6. What is R-Squared error? A sample data set was provided and I was asked to calculate the Standard Deviation, R-squared error, and correlation 7. How did you implement the results of the model in your database?
These questions were followed by regular HR questions. The complete solution is in the book.
1.We will try to build a basic KNN model with 5 neighbors. Write the code for the same. from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X, y)
y_pred = classifier.predict(X_test)
2.Let’s create a confusion matrix to see how did the model perform? from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred))
As you can see the accuracy of the model is 100%. This is because the number of training and testing dataset is very less. Once you try your hands with a larger dataset, then only you will be able to check the performance of other models.
3. Don’t get confused between KNN and K-means algorithm. What is the difference between the two? K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.
The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t — and is thus unsupervised learning.
After going through the above algorithms, you must have got a good idea about implementing these algorithms on different datasets provided in various Hackathons. Go through Kaggle or Analytics Vidhya to participate in Live and Past Hackathons.
Apart from implementation of these 5 algorithms, we also need to know the evaluation metrics, some definitions used in supervised learning, etc. All of these will be covered below
4. What is ROC curve? The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).
5. Define Precision and Recall. Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.
6.What is the difference between L1 and L2 Regularization? L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
7.What’s the difference between Type I and Type II error? Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is. A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.
8. What’s the trade-off between bias and variance? Bias is an error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.
Variance is an error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data.
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.
9. How is a decision tree pruned? Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning. Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy
10. Which is more important to you– model accuracy, or model performance? Model accuracy is a very misleading parameter to judge a model. A model could be useless even after having 99% accuracy. Suppose you are creating a model to classify a very rare disease as whether a patient is infected by that disease. Then even if you tag every patient as “not infected” then the model will have more than 99%. But this model is not at all useful. So, the model performance is the best matrix to judge the working of a model.
11. What’s the F1 score? How would you use it? The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.
12. How would you handle an imbalanced dataset? An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump: 1- Collect more data to even the imbalances in the dataset. 2- Resample the dataset to correct for imbalances 3- Try a different algorithm altogether on your dataset
13. When should you use classification over regression? Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)
14. Name an example where ensemble techniques might be useful. Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce over fitting in models and make the model more robust (unlikely to be influenced by small changes in the training data). You could list some examples of ensemble methods, from bagging to boosting to a “bucket of models” method and demonstrate how they could increase predictive power. What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
15. How do you ensure you’re not overfitting with a model? This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations. There are three main methods to avoid overfitting: 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data. 2- Use cross-validation techniques such as k-folds cross-validation. 3- Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.
16. What evaluation approaches would you work to gauge the effectiveness of a machine learning model? You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You could use measures such as the F1 score, the accuracy, and the confusion matrix. What’s important here is to demonstrate that you understand the nuances of how a model is measured and how to choose the right performance measures for the right situations
17. How do you handle missing or corrupted data in a dataset? You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
18. What happens when you take large value for K in KNN algorithm? A large value of K in KNN algorithm makes it completely expensive. It means that you are creating large clusters.
19. What happens when you take smaller value for K? A small value of k means that noise will have a higher influence on the result
20. How to select the optimum value of k in knn? Thereafter methods which are used to identify the correct or optimal value of k in knn algorithm. The methods are:– a. Elbow method b. Cross Validation method
21. What is a cross-validation method? Cross-validation can be used to estimate the test error associated with a learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
22. How does KNN algorithm works? KNN works by analogy. The idea is that you are what you resemble. So when we want to classify a point we look at its K-closest (most similar) neighbors and we classify the point as the majority class in those neighbors. KNN depends on two things: A metric used to compute the distance between two points and the value of “k” the number of neighbors to consider. When “k” is a very small number KNN can over fit, it will classify just based on the closest neighbors instead of learning a good separating frontier between classes. But if “k” is a very big number KNN will under fit, in the limit if k=n KNN will think every point belongs to the class that has more samples. KNN can be used for regression, just average the value for the k nearest neighbors or a point to predict the value for a new point. One nice advantage of KNN is that it can work fine if you only have a few samples for some of the classes.
23. What is the difference between binary classification and multi-class classification? In a binary classification model we need to classify the output in only two types Like Typhoid or normal, Male or Female, Survived or Not, etc. In multi class classification we need to classify the output in more than two types. Like, Types of flowers or types of animals, etc.
24.Write a program to do cross validation for knn # creating odd list of K for KNN from sklearn.model_selection import cross_val_score myList = list(range(1,50)) # subsetting just the odd ones neighbors = filter(lambda x: x % 2 != 0, myList) # empty list that will hold cv scores cv_scores =  # perform 10-fold cross validation for k in neighbors: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=10, scoring=’accuracy’) cv_scores.append(scores.mean())
25. Guess the case here:- a. If a person will purchase a new home or not – Classification b. Number of bikes rented in a month – Linear c. Whether a person has diabetes – Classification
26. What is reshape function? NumPy provides the reshape() function on the NumPy array object that can be used to reshape the data. The reshape() function takes a single argument that specifies the new shape of the array. It is common to need to reshape a one-dimensional array into a two-dimensional array with one column and multiple arrays. It gives a new shape to an array without changing its data (123) becomes (123,1) if we use the code y.reshape(-1,1)
27. What is correlation? How can you find the correlation of variables on a data frame? Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.
28.How to explore a dataset in Python? There are multiple commands which can help you in exploring a data set. Following are a few commands: .info() .describe() .head()
29.What is loss function? At its core, a loss function is incredibly simple: it’s a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. As you change pieces of your algorithm to try and improve your model, your loss function will tell you if you’re getting anywhere.
30. Why can’t we use 1000 fold to get the best accuracy? More folds results in more computational expense. The sample code for cross_val_score is given below: from sklearn.model_selection import cross_val_score reg = linear_model.LinearRegression() cv_results = cross_val_score(reg,X,y,cv=5)
1. What is Logistic Regression? Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.
What are the assumptions of Logistic Regression? Binary logistic regression requires the dependent variable to be binary.
2.What are the types of questions that Logistic regression can examine? How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day? Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?
3.What are the major assumptions in Logistic regression? The dependent variable should be dichotomous in nature (e.g., presence vs. absent). There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized scores, and removing values below -3.29 or greater than 3.29. There should be no high correlations (multicollinearity) among the predictors. This can be assessed by a correlation matrix among the predictors. Tabachnick and Fidell (2013) suggest that as long correlation coefficients among independent variables are less than 0.90 the assumption is met.
4.What is over fitting? When selecting the model for the logistic regression analysis, another important consideration is the model fit. Adding independent variables to a logistic regression model will always increase the amount of variance explained in the log odds (typically expressed as R²). However, adding more and more variables to the model can result in overfitting, which reduces the generalizability of the model beyond the data on which the model is fit.
Let’s explore one more model i.e. Support Vector Classifier 6.What is Support Vector Classifier? A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.
7.Give an example to explain SVM. Let’s suppose we have a distribution of few items
What an SVM does is that it makes a line of separation between the types of objects i.e. circle and rectangle. The attribute of test dataset is then examined on various parameters and are then placed in one of the two buckets like give below
8. What will happen if data points overlap, i.e. circles and rectangles are on the same point? In a real-world application, finding the perfect class for millions of training data set takes a lot of time. As you will see in coding. This is called regularization parameter. In the next section, we define two terms regularization parameter and gamma. These are tuning parameters in SVM classifier. Varying those we can achieve considerable nonlinear classification line with more accuracy in a reasonable amount of time.
9.What is a kernel? SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of kernel is to take data as input and transform it into the required form. For linear kernel the equation for prediction for a new input using the dot product between the input (x) and each support vector (xi) is calculated as follows:
f(x) = B(0) + sum(ai * (x,xi))
This is an equation that involves calculating the inner products of a new input vector (x) with all support vectors in training data. The coefficients B0 and ai (for each input) must be estimated from the training data by the learning algorithm.
10.What is the equation for polynomial kernel? Leave the question for now if you don’t want to go deep into Mathematics. The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d exponential as K(x,xi) = exp(-gamma * sum((x — xi²))
11. What is Regularization? The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example.
12. What is the impact of a large and a small value of c? For large values of C, the optimization will choose a smaller-margin hyper plane if that hyper plane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyper plane, even if that hyper plane misclassifies more points.
13.What is Gamma? The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far away from plausible separation line are considered in calculation for the separation line. Whereas high gamma means the points close to plausible line are considered in calculation.
14.What is a margin? A margin is a separation of line to the closest class points.
15.What is a good and a bad margin? A good margin is one where this separation is larger for both the classes. Images below gives to visual example of good and bad margin. A good margin allows the points to be in their respective classes without crossing to other class.
16.Let’s build the Support Vector Classifier model on our dataset from sklearn.svm import SVC #Support Vector Classifier SupportV = SVC() SupportV.fit(X,y) SupportV_prediction = SupportV.predict(X_test) print (SupportV_prediction)
17.How to measure the accuracy of models ? We can use the accuracy_score() function which takes two parameters which are prediction values and real output i.e. y_test. These are present in the following packages import numpy as np from sklearn.metrics import accuracy_score
18. Print the accuracy of each model DecisionT_acc = accuracy_score(DecisionT_prediction,y_test) RandomF_acc = accuracy_score(RandomF_prediction,y_test) LogisticR_acc = accuracy_score(LogisticR_prediction,y_test) SVC_acc = accuracy_score(SupportV_prediction,y_test) print(DecisionT_acc) print(RandomF_acc) print(LogisticR_acc) print(SVC_acc)
Neighbor is one such algorithm which is very useful in classification problem.
It is a very basic algorithm which gives a good accuracy.
19.What is KNN Algorithm? The intuition behind the KNN algorithm is one of the simplest of all the supervised machine learning algorithms. It simply calculates the distance of a new data point to all other training data points. The distance can be of any type e.g Euclidean or Manhattan etc. It then selects the K-nearest data points, where K can be any integer. Finally it assigns the data point to the class to which the majority of the K data points belong.
20. What are the pros of KNN model? a. It is extremely easy to implement b. The KNN algorithm much faster than other algorithms that require training e.g. SVM, linear regression, etc. c. Since the algorithm requires no training before making predictions, new data can be added seamlessly. d. There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)
21. What are the cons of KNN model? a. The KNN algorithm doesn’t work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension b. The KNN algorithm has a high prediction cost for large datasets. This is because in large datasets the cost of calculating distance between new point and each existing point becomes higher c. Finally, the KNN algorithm doesn’t work well with categorical features since it is difficult to find the distance between dimensions with categorical features
1.Let’s create our dataset first. We will create Thyroid data set with attributes as Weight, Blood Sugar, and Sex(M=1,F=0) #Weight,Blood Sugar, and gender Male = 1, Female = 0 X = [[80, 150,0], [90, 200, 0], [95, 160, 1], [110, 200, 1], [70, 110, 0], [60,100,1], [70, 300, 0], [100,200,1], [140, 300, 0], [60, 100, 1], [70,100,0], [100,300,1], [70, 110, 1]]
y = [‘Thyroid’,’Thyroid’,’Normal’,’Normal’,’Normal’,’Normal’,’Thyroid’,’Normal’,’Thyroid’, ‘Normal’,’Normal’,’Thyroid’,’Normal’]
2. What are X and y? The list X contains the attributes and y contains the classification
3. How to build the testing dataset? We have to create a test data set also. Let’s take some values which are not identical to the above values but are close to the labels. So we know that [90,200,0] is a thyroid patient , so we will test our model on [100,250,0] and we have already labeled it to Thyroid
4. Let’s understand Decision Tree first. Define Decision Tree. Decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
5.Define Leaf and Node of a Decision Tree A decision node has two or more branches. Leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.
6.How does a Decision Tree works? What is splitting? Decision tree follows three steps – Splitting, Pruning, and Tree Selection. Decision Tree starts with Splitting which is a process of partitioning the data into subsets. Splits are formed on a particular variable. In the above Decision tree the split on the first level happened on the variable which is Age. Then further split happened on Pizza and exercise in morning.
7. What is pruning? The shortening of branches of the tree. Pruning is the process of reducing the size of the tree by turning some branch nodes into leaf nodes, and removing the leaf nodes under the original branch. Pruning is useful because classification trees may fit the training data well, but may do a poor job of classifying new values. A simpler tree often avoids over-fitting
8. What is tree selection? The process of finding the smallest tree that fits the data. Usually this is the tree that yields the lowest cross-validated error.
9.What is entropy? A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one.
10. Let’s create a Decision Tree Classifier from sklearn import tree
11.Explain the above code First we imported tree package from sklearn library. The function DecisionTreeClassifier in the tree package holds the model so we initialize our model ‘DecisionT’ with the above function. This will create a decision tree model. Now we need to fit the model on our training data set i.e. X and y. So we have used DecisionT.fit(X,y). Predict function takes up your test data set and predicts it on the basis of values
12.What is a Random forest? Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.
13.What are the applications of Random Forest? Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases. It lies at the base of the Boruta algorithm, which selects important features in a dataset.
14.What is feature importance in Random Forest? A great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this, that measures a features importance by looking at how much the tree nodes, which use that feature, reduce impurity across all trees in the forest. It computes this score automatically for each feature after training and scales the results, so that the sum of all importance is equal to 1.
15. How does the Random Forest Algorithm works? Step 1 – The algorithm will select a random sample from the given dataset Step 2 – Construct a complete Decision Tree and get a prediction result Step 3 – Get a vote from each predicted result Step 4 -Select the prediction result with the most votes as the final prediction.
16.What are the advantages of Random Forest? The one main advantage of Random Forest is that it considers almost all the combination of results so the accuracy on the training dataset is very high. It does not suffer from over fitting problem. One more advantage is that it can be used in Regression and Classification problem
17.What are the disadvantages of Random Forest? It fails to provide the same level of accuracy on the test data set because the algorithm is not trained on unseen values, so it loses accuracy there. The model is made up of multiple trees, so it is hard to interpret the backend algorithm
18. Do a Random Forest vs Decision Tree. Many decision trees make up a forest Decision trees are computationally faster Random Forest is difficult to interpret
19. We already have a Decision Tree model at place, now let’s create a Random Forest Classifier? from sklearn.ensemble import RandomForestClassifier
20. Explain the code above. The process of building the model is same as Decision Tree. Import the Random Forest Classifier package, fit the model on the training dataset, use predict() function to predict values for test data
Let’s create one more model, a basic but highly effective model i.e. Logistic Regression model