## Machine Learning

Most asked Machine Learning Questions

What is mostly asked in any Data Science interview?

It’s the Machine Learning Questions mixed with Statistics and how to make the model perform in real time.

We have worked with 30+ industry experts from different product based companies to come up with these most asked Machine Learning Questions

MACHINE LEARNING

ANSWERED

- What is the importance of data cleaning?
- What are the basic checks for data cleaning?
- What is the difference between forecasting & prediction.
- Explain Missing Value Treatment by mean,mode, median, and KNN Imputation
- What do you mean when I say “The model has high accuracy in Training dataset but low in testing dataset?”
- Define Loss function,
- What are the metrics to measure the performance of your Linear Regression model?
- What is the need to remove multicolinearity?
- Define ROC in layman terms.
- What is the best fit line in Linear Regression?
- When do we use Linear and when do we use Logistic regression?
- How is Ridge Regression different from Linear Regression?
- Explain precision in the simplest terms.
- Explain recall in simple terms.
- How does the value of R squared and adjusted R Squared error change when you add new variable in your model?
- How to select the number of trees in a random forest?
- How to choose k in k-means?
- What is a decision tree?
- What are the different use cases where machine learning algorithms can be used?
- Differentiate between KNN and K-means.
- What are some of the latest machine learning papers that you have read?
- Which approaches are used to evaluate the prediction accuracy of a logistics regression model?
- What is the meaning of “Overfitting”?
- How will you avoid it?
- How to do dimension reduction?
- Explain Confusion Matrix Machine Learning
- What is Auto Regression?
- How to handle an imbalanced dataset?
- How would you determine which rooms and areas are underutilized and overutilized?
- What evaluation approaches would you work to gauge the effectiveness of a machine learning model?
- How can we implement different word2vec methods?
- How do linear and logistic regression differ in their error minimization techniques?
- What is topic modeling?
- What is pruning in case of decision trees?
- Analyze model’s performance.
- What are the assumptions required for linear regression?
- Which clustering algorithm would you use?
- How to improve accuracy accuracy of linear regression model?
- Explain about the box cox transformation in regression models.
- How can we use the Naive Bayes classifier for categorical features?
- When can parallelism make your algorithms run faster?
- Which among the following two metrics would you consider while implementing spam classifier?
- How to cluster unsupervised data where all the attributes and its values are categorical?
- Is it beneficial to perform dimensionality reduction before fitting an SVM?
- Is it beneficial to perform dimensionality reduction before fitting an SVM?
- What is the need of feature selection?
- When are the linear regression lines perpendicular?
- Is rotation necessary in PCA?
- Prior to building any kind of model, why do we need to complete the feature selection step?
- Name and describe three different kernel functions and in what situation you would use each.
- Perform dimension reduction.
- What to do if one of my column with integer value is having more than 30% missing values ?
- Which one would likely perform better- Linear Regression or Random Forest Regression?Why?
- Do gradient descent methods at all times converge to a similar point?
- How to make the model free from underfitting?
- How would you evaluate a logistic regression model?
- Do you think 50 small decision trees are better than a large one?
- How do you improve the table?
- What metrics we should use to evaluate a binary classification model?
- What are your favourite use cases of machine learning models?
- If through training all the features in the dataset, an accuracy of 100% is obtained but with the validation set, the accuracy score is 75%. What should be looked out for?
- What are the three components of time series data?
- How will you make sure that your model is not undergoing any type of overfitting?
- What should be the value of a good variable which we should include in our model?
- What will you do if removing missing values from a dataset causes bias?
- What is bag-of-words?
- Where do you usually source datasets?
- How to identify the important variable for my Linear regression model ?
- What is the different regression algorithm?Explain the trend in forecasting.
- How can you handle an imbalanced dataset?
- What are the steps for wrangling and cleaning data before applying machine learning algorithms?
- What happens to our linear regression model if the column z in the data is a sum of columns x and y and some random noise?
- Time series regression model got higher accuracy than decision tree model.Can this happen? Why?
- Is it better to spend five days developing a 90-percent accurate solution or 10 days for 100-percent accuracy?
- While working on model, what among them is more important- Model Accuracy or Model Performance?
- Is it necessary to perform resampling in your dataset? How would you initiate with this process?
- Give an example of outlier values and how can they be treated?
- Do having more outlier values a good thing or a bad thing?
- What is tokenization and lemmatization in NLP?
- How much data will you allocate for your training, validation and test sets?
- How will this variation be beneficial?
- Is the mean imputation of missing data acceptable practice? Why or why not?
- What is multivariate normality in assumptions of Linear Regression?
- If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression?
- While working on a data set, how do you select important
- What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?
- What is the use of stringAsFactors set as False?
- State one situation where the set-based solution is advantageous over the cursor-based solution.
- What is Multicollinearity ?
- When modifying an algorithm, how do you know that your changes are an improvement over not doing anything?
- How to forecast number of Pizza which will be sold in Pizza Hut in next week? Question
- We know that one hot encoding increases the dimensionality of a dataset, but label encoding doesn’t. How?
- What are the various aspects of a Machine Learning process?
- How would you perform feature selection on the dataset?
- What is the difference between machine learning and deep learning?
- What is CNN?
- Is it better to have too many false negatives or too many false positives?
- Executing a binary classification tree algorithm is a simple task. But, how does a tree splitting take place?
- How to read PACF graph?
- What cross-validation technique would you use on a time series dataset?
- Can you cite some examples where both false positive and false negatives are equally important?
- How do you come up with an algorithm that will predict what the user needs after they type only a few letters?
- If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?
- Give examples where a false negative is more important than a false positive, and vice versa.
- If the model isn’t perfect, how would you like to select the threshold so that the model outputs 1 or 0 for label?
- All things you need to know about Tensorflow.
- List out the difference between linear and logistic regression.
- You have built a multiple regression model.
- Your model R² isn’t as good as you wanted. For improvement, you remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?
- Why is mean square error a bad measure of model performance? What would you suggest instead?
- Can we use the logistic regression algorithm for a regression problem?
- We know that one hot encoding increases the dimensionality of a dataset, but label encoding doesn’t. How?
- Why is “Naive Bayes” naive?
- Give some problems or scenarios where map-reduce concept works well and where it doesn’t work.
- When it comes to Evaluation of Linear Regression which Evaluation Metrics would be the best when we have redundant Variables in our dataset?
- Why is “Naive Bayes” naive? Question
- What is maximum likelihood estimation?
- Could there be any case where it doesn’t exist?
- Find out the optimal speed for higher fuel efficiency.
- How do predict “y” at time t+1?
- What is the role of trial and error in data analysis?
- You notice a system uses a lot of triggers to enforce foreign key constraints, and the triggers are error-prone and difficult to debug. What changes can you recommend to reduce the use of triggers?
- What is the difference between Boosting and Bagging?
- What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?
- Treating a categorical variable as a continuous variable would result in a better predictive model? How?
- Explain Long Short Term Memory Algorithm in brief Question With code snippet in R or Python
- What is the main difference between K-Means and K-Means++ algorithm?
- Whose value of the metric D is more reliable?
- What is the difference between cyclicity and seasonality?
- What is the acceptable value range for p,d and q in ARIMA?
- What is Homoscedasticity in assumptions of Linear Regression?
- If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?
- Why Data Scientist?
- Differentiate between classification and regression in Machine Learning.
- What are the types of Machine Learning?
- Why learning rate should be less in gradient descent?
- How to identify given data is structured or unstructured?
- How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?
- How would you optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
- For tuning hyperparameters of your machine learning model, what will be the ideal seed?
- While working on a data set, how do you select important variables?
- UNANSWERED
- How to measure model performance?
- What is the effect on the coefficients of logistic regression if two predictors are highly correlated?
- What are the last machine learning papers you’ve read
- What is the difference between Stochastic gradient descent, batch gradient descent and mini batch gradient descent?
- How would you develop a model to identify plagiarism?
- Suggest some ways through which you can detect anomalies in a given dataset.
- What do you understand by information gain?
- How will you deal with variables?
- What is the difference between supervised and unsupervised machine learning?
- How do you think TVF makes a profit?
- Is it possible to capture the correlation between continuous and categorical variables?
- Is random weight assignment better than assigning same weights to the units in the hidden layer?
- You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?
- What is the confusion matrix? Explain it for a 2-class problem
- How do we decide if standardization is better or scaling of data is better without using cross validation techniques ? Will it be dependent on the algorithm we are using (distance based or not) or we need to dig deeper in our data itself ?
- What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?
- What is a Random Forest?
- Why are long-tailed distributions important in classification and regression problems?
- What is inductive Machine Learning ?
- How do you know that one algorithm is better than the other?
- How would you assess the validity of a result?
- How to verify the trigger is fired or not and can you invoke the trigger on demand?
- How can you speed up the model’s classification/prediction time?
- What method would you choose and why?
- How can you prove an improvement you introduced to a model is actually working?
- Why instance-based learning algorithm sometimes referred to as Lazy learning algorithm? Explain.
- What features would you use to build a recommendation algorithm for users?
- Is it easy to parallelize training of a random forest model?
- Provide a simple example of how an experimental design can help answer a question about behaviour.
- How will you prepare a model to automatically delete the spam emails in your inbox?
- How do you test whether a new credit risk scoring model works?
- Which model should I choose for production and why?
- How to avoid overfitting?
- what is markpov network?
- Is it possible to model transitioning data like time-zone or a working directory?

How can you design a product recommendation system based on taxonomy? - While working on a data set, how do you select important variables?
- Can we formulate the search problem as a classification problem? What methods are used for Missing Value Treatments?
- What risks and pitfalls can compromise your data during transmission and loading?
- Given an existing set of purchases, how do you predict the purchase of the next few items?
- What are the steps to build and evaluate a linear regression model in R?
- What are the areas in robotics and information processing where the sequential prediction problem arises?
- Why is it important for the time series to be stationary before the analysis of the same?
- What is the role of time series algorithms in Data Science? What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?
- How to assess a good logistic model?
- What are the differences between L1 and L2 regularization?
- What are some of the steps for data wrangling and data cleaning?
- Given training data on tweets and their retweets, how would you predict the number of retweets?
- When is Ridge regression favourable over Lasso regression?
- When using the Gaussian mixture model, how do you know it’s applicable?
- Your machine has memory constraints. What would you do?
- What are all different types of collation sensitivity?
- What are the two methods used for the calibration in Supervised Learning?
- How can data management procedures such as missing data handling make selection bias worse?
- What do you understand by Type I and Type II errors?
- What is sample complexity ? What are its type? Explain in detail.
- What cross-validation technique would you use on the time series data set?
- You tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?
- Why do we need to make the dataset stationary in time series analysis ?
- How do you come up with an algorithm that will predict what she/he needs after the user types only a few letters?
- Why data cleaning plays a vital role in the analysis?
- In the doc-term matrix, passed in LDA topic modeling, columns refer to?
- Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?
- Explain the steps that you would initiate to calculate the prediction accuracy of a given logistic regression model?
- Provide a simple example of how an experimental design can help answer a question about behaviour.
- How does a neural network with one layer and one input and output compare to a logistic regression?
- Explain what resampling methods are and why they are useful. Also explain their limitations.
- What is the difference between SGD, GD and mini batch GD? What is the effect on the coefficients of logistic regression if two predictors are highly correlated?
- What is collinearity and what to do with it? How to remove multicollinearity?
- Can a stored procedure call itself or recursive stored procedure?
- Can you explain the Naive Bayes Fundamentals?
- How Can You Choose a Classifier Based on a Training Set Data Size?
- what is the effect on the coefficients in the logistic regression?
- How can you assess a good logistic model?
- Can TF-IDF be used in Sentiment Analysis by creating our own custom Intensity analyzer?
- How to interpret ACF graph for ARIMA model ?
- Where do you usually source datasets?
- Design a recommendation engine from end to end from a dataset to deployment in production.
- What will you prefer more for your model- Model Accuracy or Model Performance?
- How to improve a Naive Bayes algorithm for spam detection?
- What do you mean by the geometric interpretation of regression?
- How will you tune hyperparameters in you model?
- In ARIMA(1,0,0) Model i.e. Auto-regressive model, it is said that the coefficient must be less than 1 and only then this model is possible. Why?
- How are confidence intervals constructed and how will you interpret them?
- Treating a categorical variable as a continuous variable would result in a better predictive model? How?
- How to read PACF graph for time series ?

When modifying an algorithm, how do you know that your changes are an improvement over not doing anything? - What cross-validation technique would you use on a time series dataset?
- Which ML algorithm to use?
- Find the null hypothesis in this case?
- Why tree based algorithm are less likely to get affected by label encoding?
- Assuming a clustering model’s labels are known, how do you evaluate the performance of the model?
- What are the three stages to build the hypotheses or models in machine learning?
- Is it beneficial to perform dimensionality reduction before fitting an SVM?

We are pleased to inform that we have launched our Live Training session for anyone who wish to learn about Analytics domain. It was invite based for the last 3 batches. Now we are open to all.

Check all the details here – The Data Monk Super 10 and Super 20 Live Classes

There are some good interview questions on Guru.com

For any help, issues, resume overview, buying books, reviewing courses, etc. You can email us at nitinkamal132@gmail.com or contact@thedatamonk.com

## Leave a reply