What is the difference between forecasting and prediction?

Forecasting and Prediction are two different things.

You always forecast weather, but never predict the weather.

Forecasting is nothing but a extrapolation of the past. You have some historic data, and you plotted it on the co-ordinate and extrapolated it for the future. This is forecasting

Prediction is subjective, it refers to an upcoming event which might or might not happen in the future.

Remember – Any one can do a prediction.

Ex. Sachin will hit a century tomorrow. You predict which might be based on a deep analysis or could just be a simple shot in the dark.

Therefore all the forecasting are prediction but not all the predictions are forecasting.

Exponential Smoothing Forecasting in simple terms

Keep Learning 🙂

The Data Monk

How would you explain the concept of p-value to a layman ?

If you are into Data Science, then you must have heard about p-value.
I could have started it with a very superficial definition strolling around probability and significance and null hypothesis, etc. But that’s already there on multiple blogs.
We want to simplify this term in order to make you “understand” rather than remember things.
We will start with null hypothesis. What is null hypothesis?

So, Nitin was the monitor of Class VIII B, He has one job i.e. to write the name of those classmates who used to make noise in the absence of the teacher.

One day he wrote the name of Tahseen on the black board. Teacher asked Tahseen whether he was making any noise?
As usual Tahseen denied. Now, the teacher had to either believe the monitor or Tahseen.

He assumed that Tahseen did not make the noise, why? because it’s easier to disprove this.

See, it’s always to disprove something with an example than to prove something. Example, If the teacher catches Tahseen making noise then the Null Hypothesis i.e. Tahseen did not make the noise will be dispropved.

But if we take the null hypothesis as “Tahseen made noise” and you did not catch him making noise on an instance then that does not mean that the null hypothesis is proved.

Coming back to the question
Teacher had this null hypothesis – Tahseen did not make the noise
Alternate hypothesis – Tahseen made noise

Now again the next day Nitin complained that Tahseen was making noise which was again denied by Tahseen.

On the next three days also his name was written on the black-board. Now the teacher has reached a threshold where he can say with confidence that “Dude, you were making noise because you have reached a benchmark of complains and it is statistically significant to prove that my null hypothesis was wrong. Thank you Nitin :)”

This statistical significance is p-value which is nothing but a benchmark set before starting the experiment.

In general a p-value <0.05 is treated as statistically significant which means that there is 95% confidence of rejecting the null hypothesis.

I have appeared for a ton of interviews and it’s very hard to dodge this question.

 100 Questions to Master Forecasting in R: Learn Linear Regression, ARIMA, and ARIMAX
What do they ask in top Data Science Interviews: 5 Complete Data Science Real Interviews Q and A
 What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow

Keep learning 🙂
The Data Monk

Confusion Matrix in Data Science, meaning and example

What is Confusion Matrix?
Confusion Matrix is a performance measuring technique for ML Classification model.

Why do we need Confusion Matrix? Is measuring accuracy not enough?
Confusion Matrix suggests the actual accuracy of your model. For example. Suppose I want to classify if a person is suffering from a very rare disease(1/100000). Even If i build a very bad model and label everyone as ‘suffering’ from the disease, then also the accuracy of the model will be somewhere around 99%. But that model is of no use because it is unable to solve the problem of classification.
Here comes the confusion matrix which is a 2×2 matrix of predicted and actual values

Here the columns denote the Predicted values and rows denoted the Actual values.

Take example of a fire alarm
True Positive – Prediction is true i.e. there is a fire in the building and there is actually fire in the building. That’s fine
False Negative – There is ‘actual’ fire in the building but your model suggests that the alarm is ringing in vain. This is catastrophic, the same is with the disease example, i.e. the person is infected but the model is unable to identify. It’s type II error
False Positive – The building is not on fire but the model suggests that it is on fire. This is still acceptable, matlab kaam chal jaeyga. Example. The person is not infected by the virus but your model suggests that it is, you will go for a few check-ups and will confirm that you are safe :). This is Type I error
True Negative – No fire, no alarm – All chill

Accuracy = (TP+TN)/(TP+TN+FP+TN)
Precision = TP/(TP+FP) i.e. the accuracy of the positive prediction
Recall Sensitivity = TP/(TP+FN) i.e. Coverage of actual positive results
Specificity = TN/(TN+FP) i.e Coverage of actual negative results

When is precision more important than recall?
Suppose there is a Zombie apocalypse, in that case you want to put as many normal person as possible. But even a single infected person is dangerous, so you look for high precision i.e. less False positive cases

What do they ask in top Data Science Interviews: 5 Complete Data Science Real Interviews Q and A
What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow

Keep Learning 🙂

The Data Monk

Data Science vs Big Data vs Data Analytics vs Business Analyst

We often come across few terms which sounds no different but are poles apart. The same goes with Data Science, Big Data,Data Analytics, and Business Analyst. So if you are confused about the role which an employer is offering you, then this article is for you.

Data Science vs Big Data vs Data Analytics vs Business Analyst

Data Science deals with a lot of mathematics. This domain makes sure that you are sound in statistics and model implementation.

Requirement – Good in mathematics, Complete hands on Python/R, expertise in at least a couple of algorithms (Predictive modelling, NLP,Clustering, Neural Network, etc.). A degree in Mathematics/Statistics definitely helps. One of the best Data Scientist of India Rohan Rao has done his maters in Statistics from IIT Bombay

Things to do to become a Data Scientist – Conc. on Algorithms and Hackthons. Make your own winning combination and don’t forget to use XGBoost 😛

Big Data Specialist- It is a humongous amount of data stored in one place. A big data specialist knows the technology which would collapse in the future. He/She makes scalable infrastructure to cater high volume of data

Requirement – A Big Data Specialist should have good amount of experience in handling Multi TB per day data. This definitely comes with experience and you can’t learn it in a classroom course. A Big Data Specialist should have sound knowledge of building data pipelines, deploying algorithms/solution curated by Data Scientists and make the life of Data Analyst easier 😛

Data Analyst – A Data Analyst works on providing valuable insights to the business. SQL is the bread and butter of a Data Analyst. He is responsible to write optimised and efficient codes to cater the Business Requests.

A Data Analyst should have a decent knowledge of Data Science algorithms which would help him in understanding the data and providing meaningful insights. A little amount of mathematics never hurts.

Requirement – SQL, Python/R, PowerBI/Tableau,Statistics

Business Analyst – The job of a business analyst is to consume the solution provided by DA,DS, and BDS. He should have a decent knowledge of SQL and MS Excel in order to churn the numbers. Above all he should be able to consume the insights and take decisions based on the data

Requirement – Knack to solve complex business problems, SQL, MS Excel, and good communication skill

Salary wise (Person with 3 years of experience)

Data Science > Big Data Specialist > Data Analyst ~ Business Analyst
(This is too subjective and highly debatable)


Do look into the Job Description and Profile offered before going for an interview 🙂

Keep Learning 🙂
The Data Monk

Machine Learning using SQL – Day 6/100

The below article is the intellectual property of Ashish Kohli. This is one such article which actually powers the ability of SQL. Give it a read guys.

Yes, you read that one right! One of the most fundamental machine learning algorithms out there is Linear Regression. In simple words, it is a technique to describe a relationship between a response(a.k.a dependent) variable and one or more explanatory (a.k.a independent) variable(s). After doing some reading into the “math” behind these algorithms, I realized that this can be easily achieved in SQL.

I don’t intend to turn this post into another “Machine Learning 101”. There are plenty of such articles out there that explain what Linear Regression is in a much better way, including some nitty-gritty details like it’s back-end optimization algorithm, i.e. Gradient Descent. I will thus try to keep this article as less “ML”ly as possible. However, I’d recommend this 9 byte-sized (typo intended!) article series that explains all this and some more in a really easy language. Also, for most of this post, I will be referring to the formulas and notations used in the hyperlinked article.

Alright! At this point, I hope that you already know about the concepts of Linear Regression and how Gradient Descent works. And thus you’ll also know that the relationship between response and explanatory variable(s) is explained by the following equation :

The goal of Linear Regression is to find the optimal value of θ (theta) that best describes the relationship between two variables and Gradient Descent is the way to do that. The equation given below summarizes how Gradient Descent arrives at the optimal value of θ(s).

Let’s list down the activities that are needed to be performed to arrive at optimal values of θ:

  1. Start with random values of θ and calculate the value of hypothesis function (Hθ)
  2. Fill in the values of θ and Hθ in the convergence equation to get new values of θ
  3. Keep repeating Step 2 until the values of θ don’t change anymore
  4. These values of θ correspond to the minimum cost (or error) for the Hypothesis function
  5. Fill in the final values of θ in the hypothesis function to get Predicted values of the response variable

Step 1: Declaring & Initializing variables

We need 6 variables, each for a different purpose:

  1. theta0, theta1 to store the current value of θ0, θ1
  2. theta0_t, theta1_t to store temporary values of θ0 & θ1 before updating the original ones
  3. var (short for variability) to check if the updated value of θ is approaching “near” the current value or not
  4. alpha to store learning rate (read this answer at Quora to understand all about learning rate)
DECLARE @theta0 int;
DECLARE @theta1 int;
DECLARE @theta0_t int;
DECLARE @theta1_t int;
DECLARE @var DECIMAL(4,3);
DECLARE @alpha DECIMAL(4,2);
--Initial values
SET @theta0 = 0;
SET @theta1 = 1;
SET @theta0_t = 1;
SET @theta1_t = 0;
SET @alpha = 0.1;
SET @var = 0.01;

Step 2: Calculating values of Hθ and updated values of θ

--Calculating theta0
SELECT
@theta0_t = @theta0 - (SUM(Outp)/(SELECT COUNT(*) FROM base))*@alpha
FROM
(
SELECT
(@theta1*X + @theta0) - Y as Outp
FROM base
);--Calculating theta1
SELECT
@theta1_t = @theta1 - (SUM(Outp)/(SELECT COUNT(*) FROM base))*@alpha
FROM
(
SELECT
((@theta1*X + @theta0) - Y)*X as Outp
FROM base
);

Step 3: Comparing if the updated values of θ are close to original θ or not

--Comparing thetas
IF (@theta0_t BETWEEN @theta0-@var AND @theta0+@var) AND (@theta1_t BETWEEN @theta1-@var AND @theta1+@var)

If the above condition is true, then we stop the process and finalize the values of θ. Otherwise, we keep repeating steps 2 & 3. Thus steps 2 & 3 need to be put inside a loop that runs as long as the updated and current values of θ are different.

WHILE (@theta0_t NOT BETWEEN @theta0-@var AND @theta0+@var) AND (@theta1_t NOT BETWEEN @theta1-@var AND @theta1+@var)
BEGIN
–Calculating theta0
SELECT
@theta0_t = @theta0 – (SUM(Outp)/(SELECT COUNT(*) FROM base))*@alpha
FROM
(
SELECT
(@theta1*X + @theta0) – Y as Outp
FROM base
);

	--Calculating theta1
SELECT
@theta1_t = @theta1 - (SUM(Outp)/(SELECT COUNT(*) FROM base))*@alpha
FROM
(
SELECT
((@theta1*X + @theta0) - Y)*X as Outp
FROM base
);

--Comparing thetas
IF (@theta0_t BETWEEN @theta0-@var AND @theta0+@var) AND (@theta1_t BETWEEN @theta1-@var AND @theta1+@var)
BEGIN
SELECT @theta0 = @theta0_t;
SELECT @theta1 = @theta1_t;
BREAK;
END
ELSE
BEGIN
SELECT @theta0 = @theta0_t;
SELECT @theta1 = @theta1_t;
END
END

The above loop will arrive at optimal values for θ. This is Gradient Descent in all it’s glory!

Step 4: Fill in the final values of θ in the hypothesis function to calculate predictions for the response variable

SELECT X,Y,@theta0+@theta1*X AS H_theta
FROM base

And that’s it! We’ve built a machine learning algorithm in SQL with just a few lines of code!

Practical applications & final thoughts

Despite the onset of technological advancements in the field of Data Science, more often than not, every Data Scientist ends up working with legacy systems. In such cases, if the size of the data is huge, it becomes impractical to fetch it out of a legacy system (like SQL Server) into another environment for data science purposes.

Although I initially began this project as a weekend DIY, I feel this has bigger implications. This can be polished and packaged much better to improve its usability. Things like splitting of data into test & train, turning this into multi-variate linear regression will make this project much more practical. I would also love to hear thoughts of all of you on what can be improved.

Thank you Ashish.

Keep Learning 🙂
The Data Monk

Affine Analytics Interview Questions

Company – Affine Analytics
Location – Bangalore
Position – Senior Business Analyst
Experience – 3+ years

Compensation – Best in the industry

Number of Rounds – 4

I received a call from the Technical HR who scheduled the telephonic round for the next day

Round 1 – Telephonic Round (Mostly SQL and Project)
I was asked to introduce myself and then the discussion went towards my recent project at Mu Sigma. We had a good discussion on Regression Techniques, a bit on statistics.

The project description was followed by few questions on SQL (the answers to these questions are present in various articles on the website, links are at the end of the interview)

1. What is the order of SQL query execution?
2. You have two tables with one column each. The table A has 5 values and all the values are 1 i.e. 1,1,1,1,1 and Table B has 3 values and all the values are 1 i.e. 1,1,1.

How many rows will be there if we do the following
1. Left Join
2. Right Join
3. Inner Join
Answer:
http://thedatamonk.com/day-4-sql-intermediate-questions/

3. A quick guesstimate on number of Iphones sold in India per year

Hint in the below link – http://thedatamonk.com/guesstimate-3-what-are-the-number-of-smartphones-sold-in-india-per-year/

4. What is a RANK() function? How is it different from ROW_NUMBER()?

5. How to fetch only even rows from a table?

Link to Question 4 and 5 – http://thedatamonk.com/day-11-sql-tricky-questions/

6. What are the measures of Central Tendency
http://thedatamonk.com/day-14-basic-statistics/

The telephonic round went for around 1 hour:-
Introduction – 10 minutes
Project – 30 minutes
Questions – 20 minutes

I was shortlisted for the further rounds.
All together the face-to-face interviews were divided into 3 rounds
Round 1 – SQL and R/Python
Round 2 – Statistics
Round 3 – Case Study and HR questions

Round 1
There were ~20 questions on SQL and some questions on Language.
Below are the questions which I remember:-
1. Optimising a SQL code
2. Doing a sum on a column with Null values
Hint – Use Coalesce
3. How to find count of duplicate rows?
4. Use of Lag function
Link – http://thedatamonk.com/day-5-sql-advance-concepts/
5. Life cycle of a project
6. How to find the second minimum salary?
7. How to get 3 Min salaries?
8. DDL, DML, and DCL commands
http://thedatamonk.com/day-13-sql-theoretical-questions/

There were few more questions on Joins and optimising inner query codes.
Overall difficulty level- 8/10

There were 5 questions on Python/R –

Loading a csv/text file
Writing code of Linear Regression (As it was mentioned on my resume)
Doing a right join in either of the language
Removing null value from a column

Round 3 – Statistics

How to calculate IQR?
What is positive skewness and negative skewness?
What are the two types of regression?
What is multiple linear regression?
What is Logistic Regression?
What is p-value and give an example?

These questions were discussed in detail and I power the explanation with real life examples.

http://thedatamonk.com/day-18-statistics-interview-questions/

Bonus tips – Do look for good examples

Round 4 – Case Study and HR Questions

How many laptops are sold in Bangalore in a Day ?
http://thedatamonk.com/guesstimate-2-how-many-laptops-are-sold-in-bangalore-in-a-day/

Business Case Study – There is a mobile company which is very popular in Other Asian countries. The company is planning to open it’s branch in the most popular mall of Bangalore.
What should be the strategy of the company?
How can you use freely available data to plan the marketing of the campaigns?
How can you use Digital marketing to create campaigns for the company?

These questions were followed by:-
Why do you want to change the company?
How is the work in your current organisation?

I got the confirmation in 2 working days.

This was it 

Amazon Interview Questions
Sapient Interview Questions


Full interview question of these round is present in our book What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow 

You can get your hand on our ebooks

1. The Monk who knew Linear Regression (Python): Understand, Learn and Crack Data Science Interview
2. 100 Python Questions to crack Data Science/Analyst Interview
3. Complete Linear Regression and ARIMA Forecasting project using R
4. 100 Hadoop Questions to crack data science interview: Hadoop Cheat Sheet
5. 100 Questions to Crack Data Science Interview
6. 100 Puzzles and Case Studies To Crack Data Science Interview
7. 100 Questions To Crack Big Data Interview
8. 100 Questions to Learn R in 6 Hours
9. Complete Analytical Project before Data Science interview
10. 112 Questions To Crack Business Analyst Interview Using SQL
11. 100 Questions To Crack Business Analyst Interview
12. A to Z of Machine Learning in 6 hours
13. In 2 Hours Create your first Azure ML in 23 Steps
14. How to Start A Career in Business Analysis
15. Web Analytics – The Way we do it
16. Write better SQL queries + SQL Interview Questions
17. How To Start a Career in Data Science
18. Top Interview Questions And All About Adobe Analytics
19. Business Analyst and MBA Aspirant’s Complete Guide to Case Study – Case Study Cheatsheet
20. 125 Must have Python questions before Data Science interview
21. 100 Questions To Understand Natural Language Processing in Python
22. 100 Questions to master forecasting in R: Learn Linear Regression, ARIMA, and ARIMAX
23. What do they ask in Top Data Science Interviews
24. What do they ask in Top Data Science Interviews: Part 1

Keep Learning 



10 Questions, 10 Minutes – 5/100

1. What if you want to toggle case for a Python string?

We have the swapcase() method from the str class to do just that. 1. >>> ‘AyuShi’.swapcase()

‘aYUsHI’

2. Write code to print only upto the letter t.

I love Python

>>> i=0
>>> while s[i]!=’t’:
print(s[i],end=’’) 4. i+=1
I love Py

3. What is recursion?

Whenafunctionmakesacalltoitself,itistermed recursion. Butthen,in order for it to avoid forming an infinite loop, we must have a base condition.

Let’s take an example.

>>> def facto(n): 
if n==1: return 1
return n*facto(n-1)
>>> facto(4)

4. What is a function?

When we want to execute a sequence of statements, we can give it a name. Let’s define a function to take two numbers and return the greater number.

>>> def greater(a,b): 
return a is a>b else b

5. Explain Python List Comprehension.

The list comprehension in python is a way to declare a list in one line of code. Let’s take a look at one such example.

>>> [i for i in range(1,11,2)] [1, 3, 5, 7, 9] 
>>> [i*2 for i in range(1,11,2)] [2, 6, 10, 14, 18]

6. How do you get all values from a Python dictionary?

We saw previously, to get all keys from a dictionary, we make a call to the keys() method. Similarly, for values, we use the method values().

 >>> 'd' in {'a':1,'b':2,'c':3,'d':4}.values()  
False
 >>> 4 in {'a':1,'b':2,'c':3,'d':4}.values()  
True

7. What is the difference between remove() function and del statement?

You can use the remove() function to delete a specific object in the list.

If you want to delete an object at a specific location (index) in the list, you can either use del or pop.

Note: You don’t need to import any extra module to use these functions for removing an element from the list.

We cannot use these methods with a tuple because the tuple is different from the list.


8. How to remove leading whitespaces from a string in the Python?

To remove leading characters from a string, we can use lstrip() function. It is Python string function which takes an optional char type parameter. If a parameter is provided, it removes the character. Otherwise, it removes all the leading spaces from the string.

string = "  javatpoint "   
string2 = "    javatpoint        "  
print(string)  
print(string2)  
print("After stripping all leading whitespaces:")  
print(string.lstrip())  
print(string2.lstrip())  


9. Why do we use join() function in Python?

A. The join() is defined as a string method which returns a string value. It is concatenated with the elements of an iterable. It provides a flexible way to concatenate the strings. See an example below.

str = "Rohan"  
str2 = "ab"  
# Calling function    
str2 = str.join(str2)    
# Displaying result    
print(str2)  
Output:
aRohanb


10. What are the rules for a local and global variable in Python?

A. In Python, variables that are only referenced inside a function are called implicitly global. If a variable is assigned a new value anywhere within the function’s body, it’s assumed to be a local. If a variable is ever assigned a new value inside the function, the variable is implicitly local, and we need to declare it as ‘global’ explicitly. To make a variable globally, we need to declare it by using global keyword. Local variables are accessible within local body only. Global variables are accessible anywhere in the program, and any function can access and modify its value.

10 Questions, 10 Minutes – 4/100

1.How would you convert a string into an int in Python?

If a string contains only numerical characters, you can convert it into an integer using the int() function.

>>> int(‘227’) 227

Let’s check the types: 1. >>> type(‘227’)

<class ‘str’>
1. >>> type(int(‘227’))

<class ‘int’>

2.What is difference between unique and distinct?(90% asked Advanced SQL Interview Questions )

There is no difference between unique and distinct keywords apart from one difference.unique is applied before insertion and retrival.It consists of non duplicate values.if unique constraint is given it does not take duplicate values.distinct is used in retrieval it gives the suppressed row(ex if two rows are same it will show single row and non duplicate row) therefore distinct is the combination of suppressed duplicate and non duplicate rows.Specify DISTINCT or UNIQUE if you want Oracle to return only one copy of each set of duplicate rows selected (these two keywords are synonymous). Duplicate rows are those with matching values for each expression in the select list. 


3.What will be the output of following Query?

Query :
select case when null=null then ‘Amit’ Else ‘Pradnya’ from Table_Name;

In SQL null value is not equal to itself.So null=null is false and the output of above query is ‘Pradnya’.

4. Which are different Set operators in SQL?(100% asked Advanced SQL Interview Questions )

Set operators are nothing but the operators which are used to connect two tables and fetch the records from the two tables.We need to follow one condition that the table set 1 columns and table set 2 columns are same and its datatype must be same.SQL Set Operators combines the result of 2 queries or components on to the single result.

Following are Set Operators in SQL:

1. Union

2. Unionall
3. Intersect
4. Minus

5. How to select first 5 characters from First name in Employee table?

Oracle Query:
Select Substr(First_name,0,5) from Employee;

MS SQL:
Select Substr(First_name,1,5) from Employee;

MySQL:
Select Substr(First_name,1,5) from Employee;

6. What will be the output of following query? Query :Select * from (select ‘a’ union all select ‘b’) Q;

It will throw error because no values are selected in Subquery.

7. Explain co-related sub-query with example.

Fetch the Employees who have not assigned a single department.

Select * from Employee E where Not exist

(Select Department_no From Department D where E.Employee_id=D.Employee_ID);

Execution of query:

Step 1:

Select * from Employee E ;

It will fetch the all employees

Step 2:

The First Record of the Employee second query is executed and output is given to first query.

(Select Department_no From Department D where E.Employee_id=D.Employee_ID);

Step 3:
Step 2 is repeated until and unless all output is been fetched.

8. What is difference between NVL,NVL2 and Nullif?

1.NVL :

NVL function substitutes a value when a null value is encountered.

2.NVL2 :

NVL2 substitutes a value when a null value is encountered as well as when a non-null value is encountered.

3.NULLIF:

NULLIF function compares expr1 and expr2. If expr1 and expr2 are equal, the NULLIF function returns NULL. Otherwise, it returns expr1.

9. What is Index?What is use of index in SQL?

Index is optional structure associated with the table which may or may not improve the performance of Query.In simple words suppose we want to search the topic in to book we go to index page of that book and search the topic which we want.Just like that to search the values from the table when indexing is there you need not use the full table scan.

Indexes are used to improve the performance of the query.

10. What is the difference between Having and Where clause?

Where clause is used to fetch data from a database that specifies particular criteria whereas a Having clause is used along with ‘GROUP BY’ to fetch data that meets particular criteria specified by the Aggregate functions. Where clause cannot be used with Aggregate functions, but the Having clause can.

10 Questions, 10 Minutes – SQL/R/Python – 3/100

1.What will the following code output?

>>> word=’abcdefghij’ 
>>> word[:3]+word[3:]

The output is ‘abcdefghij’. The first slice gives us ‘abc’, the next gives us ‘defghij’.

2.How will you convert a list into a string?

We will use the join() method for this.


>>> nums=['one','two','three','four','five','six','seven']
>>> s=' '.join(nums)
>>> s

‘one two three four five six seven’

3. How will you remove a duplicate element from a list?

We can turn it into a set to do that.

>>> list=[1,2,1,3,4,2] 
>>> set(list) 
{1, 2, 3, 4} 

4. Explain the //, %, and ** operators in Python.

The // operator performs floor division. It will return the integer part of the result on division.

>>> 7//2 
3

Normal division would return 3.5 here.

Similarly, ** performs exponentiation. a**b returns the value of a raised to the power b.

>>> 2**10 
1024

Finally, % is for modulus. This gives us the value left after the highest achievable division.

>>> 13%7 
6

5. Explain identity operators in Python.

The operators ‘is’ and ‘is not’ tell us if two values have the same identity. 1.

>>> 10 is '10' 
False
 >>> True is not False 
True 

6. What are numbers?

Python provides us with five kinds of data types:

Numbers – Numbers use to hold numerical values.

>>> a=7.0 

7. What are Strings?

A string is a sequence of characters. We declare it using single or double quotes.

>>> title="Ayushi's Book" 


8. What are Lists?

Lists – A list is an ordered collection of values, and we declare it using square brackets.

>>> colors=['red','green','blue'] 
>>> type(colors)

<class ‘list’>



9. What are Tuples?

Tuples – A tuple, like a list, is an ordered collection of values. The difference. However, is that a tuple is immutable. This means that we cannot change a value in it.

>>> name=(‘Ayushi’,’Sharma’) 


>>> name[0]=’Avery’ 


Traceback (most recent call last):

File “<pyshell#129>”, line 1, in <module>

name[0]=’Avery’

TypeError: ‘tuple’ object does not support item assignment


10. What are Disctionary?

Dictionary – A dictionary is a data structure that holds key-value pairs. We declare it using curly braces.

>>> squares={1:1,2:4,3:9,4:16,5:25} 
>>> type(squares)

<class ‘dict’>
1. >>> type({})

<class ‘dict’>
We can also use a dictionary comprehension:

>>> squares={x:x**2 for x in range(1,6)} 
>>> squares {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}


10 Questions, 10 Minutes – 2/100

This is something which has been on my mind since a long time. We will be picking 10 questions per day and would like to simplify it.
We will make sure that the complete article is covered in 10 minutes by the reader. There will be 100 posts in the coming 3 months.

The articles/questions will revolve around SQL, Statistics, Python/R, MS Excel, Statistical Modelling, and case studies.

The questions will be a mix of these topics to help you prepare for interviews

You can also contribute by framing 10 questions and sending it to contact@thedatamonk.com or messaging me on Linkedin.

The questions will be updated late in the night ~1-2 a.m. and will be posted on Linkedin as well.

Let’s see how many can we solve in the next 100 posts

1/100 – SQL Questions

1. How to find the minimum salary using subquery?
-SELECT *
FROM employee
WHERE salary = (select MIN(salary) from employee);

2. How to find the second minimum salary?
– SELECT *
FROM employee
WHERE salary = (SELECT MIN(salary) FROM employee > SELECT MIN(salary) FROM employee)

Similarly, find the third minimum salary

– SELECT *
FROM employee
WHERE salary = (SELECT MIN(salary) FROM employee > SELECT MIN(salary) FROM employee > SELECT MIN(salary) FROM employee)

3. The above query is too lengthy, write a query to get the third minimum salary with some other method.

– SELECT DISTINCT (salary)
FROM emp e1 where 3 = (SELECT COUNT(DISTINCT salary) FROM emp e2 WHERE e1.sal >= e2.sal);

4. How to get 3 Min salaries?
-SELECT DISTINCT salary FROM emp a WHERE 3 >= (SELECT COUNT(DISTINCT salary) FROM emp b WHERE a.salary >= b.salary);

5. Some basic SQL Select questions
– SELECT 125
125
-SELECT ‘Ankit’+’1’
Ankit1
-SELECT ‘Ankit’+1
Error
– SELECT ‘2’+2
4
-SELECT SUM(‘1’)
1

6. Write a generic method to fetch the nth highest salary without TOP or Limit

SELECT Salary
FROM Worker W1
WHERE n-1 = (
 SELECT COUNT( DISTINCT ( W2.Salary ) )
 FROM Worker W2
 WHERE W2.Salary >= W1.Salary
 );

7. LAG(): Provides access to a row at a given physical offset that comes before the current row. Use this function in a SELECT statement to compare values in the current row with values in a previous row as
specified by offset. Default offset is 1 if not specified. If Partition By clause is specified then it returns the offset Value in each partition after ordering the partition by Order By Clause.

Basically, lag() is used to create one more column in the table where you can get the previous value of the specified column

Col1Col2Lag_Col
a10Null
b2010
c3020
d4030

8.One more example

employee_numberlast_namefirst_namesalarydept_id
12009SutherlandBarbara5400045
34974YatesFred8000045
34987EricksonNeil4200045
45001ParkerSally5750030
75623GatesSteve6500030
SELECT dept_id, last_name, salary,
LAG (salary,1) OVER (ORDER BY salary) AS lower_salary
FROM employees;
dept_idlast_namesalarylower_salary
45Erickson42000NULL
45Sutherland5400042000
30Parker5750054000
30Gates6500057500
45Yates8000065000

9. LEAD() – Provides access to a row at a given physical offset that comes after the current row. Use this function in a SELECT statement to compare values in the current row with values in a subsequent row
as specified by offset. Default offset is 1 if not specified. If Partition By clause is specified then it returns the offset Value in each partition after ordering the partition by Order By Clause

10. Which operator is used for Pattern Matching?

LIKE operator is used for pattern matching. It supports below wildcards.
 % : Matches any string of zero or more characters.
 _ : Matches any single character.
 [] : Matches any single character within the specified range ([a-f]) or set ([abcdef]).
 [^] : Matches any single character not within the specified range ([^a-f]) or set ([^abcdef])


This was the second set of 10 questions, if you want to learn more about the type of questions asked in different Data Science interviews then do try the below book:-

 What do they ask in top Data Science Interviews: 5 Complete Data Science Real Interviews Q and A

 What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow

Keep Learning 🙂

The Data Monk