Practice Numpy

Numpy is one of the basic and most used packages in Python. It is mostly used for scientific computing.

Most important Data Type in Numpy is Array.

import numpy as np

And you have installed the numpy package in your system. Now, check the important and basic functions present in Numpy package to play around your dataset.

1. Import numpy
import numpy as n

2. Create an array in python

arr = np.array([0,1,2,3,45,6])
print(arr)
[ 0 1 2 3 45 6]

3. Create a two dimensional array

arr2 = np.array([[1,2,3],[5,6,7]])
print(arr2.shape)
(2, 3)

4. Create an array of all ones of 3 rows and 4 columns

one = np.ones((3,4))
print(one)
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

5. Create an array with all zeros in 3 rows and and 4 columns

zero = np.zeros((3,4))
print(zero)
[[0. 0. 0. 0.]  [0. 0. 0. 0.]  [0. 0. 0. 0.]]

6. Create an array with random values for 3 rows and 4 columns

rand = np.random.random((3,4))
print(rand)
[[0.15340682 0.57965139 0.15538918 0.35086011]
 [0.35453516 0.96230823 0.49873964 0.022157  ]
 [0.35144323 0.42548206 0.96434164 0.67721939]]

7. Statistical functions available in Numpy
a. Min
b. Max
c. Mean
d. Median
e. Standard Deviation

First of all, we will be creating a series of random numbers using np.random.normal function which gives you the flexibility to choose the range in which you want to have your random numbers, irony is that you can restrict the randomness of your random number 😛

import numpy as np
stat = np.random.normal(8,1.5,10)
print(stat)

[8.84371443 7.7014931  6.72789277 5.73700496 8.46005265 9.08922936  8.71789028 6.84561345 8.20465228 7.26850749]

There are three parameters in the function separated by a comma.
8 is the base value
1.5 is the range you which you are providing to the function i.e. the values will vary in the range 6.5 to 9.5
10 is the number of random values

8. Now let’s get the min, max, mean, standard deviation from “stat”

print(min(stat))
5.737004956818833

print(max(stat))
9.089229356076414

print(np.mean(stat))
7.759605075083963

print(np.std(stat))
1.0385313768435045

9. Matrix multiplication in numpy

x = [[1,2],[4,5]]
y = [[6,7],[8,9]]
z = np.matmul(x,y)
print(z)

[[22 25]  [64 73]]



EDA in Python

The complete Machine Learning journey can be penned down in 4 steps:-

1. Exploratory Data Analysis This is the first thing you do when you get a dataset. Before jumping on to building models, you need to first understand the nature of the data. EDA helps in making it easier for the audience to get along with data.

EDA includes visualizing the raw data, looking for correlation in the dataset and finding missing values in the data set. In short you have to plot a hell lot of graphs to understand the dataset.

2. Cleaning the data – You will spend more than 50 percent of your time in cleaning the data and doing missing value treatment. Cleaning is important because the accuracy of your model will depend on the number of proper data points.

3. Building models – We are talking about Machine Learning algorithms, so, once you have the clean data, you need to build models, visualize results, check the results and improve the success metric of the model

4. Result Presentation – You have the results of the model. This result is of no use until and unless it is consumed by the audience. You will again need the power of visualizations to prove the result of your analysis

We will take us a data set and will build some graphs from scratch.
I will be using the Titanic Data set because of the following reasons:-
1. Firstly, It is freely and easily available at this link – https://www.kaggle.com/c/titanic/data
2. It’s clean and easy to understand

Gist of the dataset – Titanic train dataset contains the various information(like. age, sex, family size of the passenger, etc.) about those who survived and who could not survive

  1. PassengerId: Id of every passenger.
  2. Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
  3. Pclass: There are 3 classes of passengers. Class1, Class2 and Class3.
  4. Name: Name of passenger.
  5. Sex: Gender of passenger.
  6. Age: Age of passenger.
  7. SibSp: Indication that passenger have siblings and spouse.
  8. Parch: Whether a passenger is alone or have family.
  9. Ticket: Ticket no of passenger.
  10. Fare: Indicating the fare.
  11. Cabin: The cabin of passenger.
  12. Embarked: The embarked category.
  13. Initial: Initial name of passenger.

Import the following libraries:

import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd

We will directly start with plotting some insightful graphs to get a gist of the problem statement. The codes are self explanatory but We will try to provide as much explanation as possible

We will start with importing the test and train files.

Titanic_Train = pd.read_csv('/Users/nitinkamal/Downloads/titanic/train.csv')
Titanic_Test = pd.read_csv('/Users/nitinkamal/Downloads/titanic/test.csv')

Now we will start plotting on these data points

1. Let’s see how many people actually survived in the training

import seaborn as sb
sb.countplot('Survived',data=Titanic_Train)
Count of number of people surviving the Titanic

2. Using SibSp to get the number of people who survived the incident

Titanic_Train[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar()
sb.countplot('SibSp',hue='Survived',data=Titanic_Train,)
plt.show()

3. Check the number of survivors and non-survivors on the basis of gender

import seaborn as sb
sb.catplot(x='Sex', col='Survived', kind='count', data=Titanic_Train)

4. Let’s check the survival on Embarkment i.e. S,C and Q

sb.catplot(x='Survived', col='Embarked', kind='count', data=Titanic_Train);

5. Cross tab covers three dimensional information
e.g. Survivor on the basis of gender and class

pd.crosstab([Titanic_Train.Sex, Titanic_Train.Survived], Titanic_Train.Pclass, margins=True).style.background_gradient(cmap='coolwarm')

6. Getting the survived data according to Passenger Class

pd.crosstab(Titanic_Train.Pclass, Titanic_Train.Survived, margins=True).style.background_gradient(cmap='autumn_r')

Start with Python

Python is one of the most preferred language for Data Science. It is needless to discuss the pros and cons of Python over any other language(R/SAS/Java/C).

If you are already comfortable with any other language then it’s good, but if you are still exploring and are ready to pick a new language, then Python it is.

At the time of writing this blog, two versions of Python are popular
Python 2.7
Python 3.*

Let’s start with installing the snake.

1. All you need is – Anaconda
Once you have it in your system, install it and you will be able to execute python codes. I would recommend using Jupyter notebook, this was you will be able to document your work and analysis.

How to launch Jupyter Notebook?
Once you have installed Anaconda, you will get an Anaconda Navigator in your start menu or on your desktop.
Double click to open it.

This is how Anaconda Navigator will look like. Click on the Launch button below the Jupyter Notebook

The Jupyter notebook will look something like the one below:

Click on the new button and select Python 3(if Python 3 has been installed)

Running your first Python program

What do you want Python to print?
“Hello World” ??

Yeah !! That’s how we start

Following libraries are the bread and butter of a Data Scientist:-
1. Matplot – To visualize your data
2. Numpy – In eases the complex mathematical operations
3. Pandas – It provides data structure of a high level and a very good variety of tools for analysis
4. SciPy – For Machine Learning
5. Pytorch – To provide tensor computation
6. Keras – Talks about Neural Network in detail
7. Scikit- It holds a lot of unsupervised learning algorithms

To start with, we will use Matplot, Numpy, Pandas and Scikit.

Though I recommend solving at least 25 questions from any of the below websites to make you comfortable with functions, but you can chuck this part and jump directly on the Data Analysis part
1. Codechef
2. Spoj
3. HackerRank

If you want to get a head start in Python, then you can go through either w3school or tutorials point first. But, in case you want to learn it in a more practical way, you can stick to this website.

In the coming few days, we will quickly jump from basic data types and structures, and will start exploring few algorithms on really small dataset. The reason why We will be using small datasets is that you can actually visualize the effect of each algorithm.

We will then move to solving a couple of Hackathons.

See, it’s very irritating to write or explain the different data types of a new language. We will try to keep it simple and crisp, once you start using these is when you will understand it better.

Data Types:-
1. Number – Daaahh !! It’s just numbers(int, long, float and complex)
2. String – Same old story. Remember, index starts with 0
3. List –
-A list can contain anything
-It also starts with index 0
-It is MUTABLE
– Square bracket

list1 = ['alpha', 'beta', 'Shaktiman', 8382]
list2 = ["d", "a", "t", "a"]

4. Tuples –
-Same as List
-Tuple is IMMUTABLE
-Starts with index 0
– Round bracket

tup1 = ('alpha', 'beta', 'Shaktiman', 8382)
tup2 = ("d", "a", "t", "a")

5. Dictionary
– Contains key and values where key is the identifier and value is the value. Example – Abba:6132, Dabba:6292,Jabba:6002
-Curly braces

Student = {
  "Abba": 6132,
  "Dabba":6292,
  "Jabba":6002
}

Dekho, itna ho gya..Baaki jab time aaeyga toh khud sheekh jaaoge..Haan agar aur kuch specific malum krna ho toh comment daal dena

Will learn python in a question answer way in the next article

Keep Learning 🙂

The Data Monk

Less asked SQL questions

We have already covered queries on joins, aggregate functions, sub-queries, running sum, etc.

This article will concentrate on the theoretical part of SQL which are less asked but are important to know about

1.What are the different types of statements supported in SQL?

There are three types of statements in SQL:-
a. DDL – Data Definition Language
b. DML – Data Manipulation Language
c. DCL – Data Control Language

2. What is DDL?

It is used to define the database structure such as tables. It includes 3 commands:-
a. Create – Create is for creating tables 
CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ….
); 
b. Alter – Alter table is used to modifying the existing table object in the database.
ALTER TABLE table_name
ADD column_name datatype
c. Drop – If you drop a table, all the rows in the table is deleted and the table structure is removed from the database. Once the table is dropped, you can’t get it back

3. What is DML?

Data Manipulation Language is used to manipulate the data in records. Commonly used DML commands are Update, Insert and Delete. Sometimes SELECT command is also referred as a Data Manipulation Language.

4. What is DCL?

Data Control Language is used to control the privileges by granting and revoking database access permission

5. What is Normalization? How many Normalization forms are there?

Normalization is used to organize the data in such manner that data redundancy will never occur in the database and avoid insert, update and delete anomalies.
There are 5 forms of Normalization
First Normal Form (1NF): It removes all duplicate columns from the table. Creates a table for related data and identifies unique column values
Second Normal Form (2NF): Follows 1NF and creates and places data subsets in an individual table and defines the relationship between tables using a primary key
Third Normal Form (3NF): Follows 2NF and removes those columns which are not related through primary key
Fourth Normal Form (4NF): Follows 3NF and do not define multi-valued dependencies. 4NF also known as BCNF

6. Full Outer Join is a combination of which of the following:-
a. Left Outer and Left Inner Join
b. Left Outer and Right Inner Join
c. Left Outer and Right Outer Join
d. Left Outer and Right Outer Join


A. Full Outer Join is a combination of Left Outer and Right Outer Join in SQL

7. What is unnest in SQL?

In General, unnesting comes into picture when you are dealing with Big Data because there you will have data in different formats. You might have to deal with an array in a particular column.
Array is nothing but a sequence of data points under one heading.

Let’s take an example of creating an array in BigQuery

Select ['Jan','Feb','Mar'] as Month
Select [0,0,1] as Bool
SELECT *
FROM UNNEST(['a', 'b', 'c', 'd', 'e', 'f'])AS alpha
WITH OFFSET AS offset
ORDER BY offset;

The above code will create the following table

alpha offset
a 0
b 1
c 2
d 3
e 4
f 5

8. What is RDBMS?

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables.

9. What is NoSQL ?
NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale. NoSQL database is used for distributed data stores with humongous data storage needs.

10. What is a unique key?

A Unique key constraint uniquely identified each record in t
A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.
There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

11. What is a foreign key?

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

11. Table 1 has only one column ‘Col1’ having values 1,2,3,4
Table2 has only one column ‘Col2’ having values 1,1,4,6

Inner Join on the two table will get you how many rows?


Inner Join will get you 3 rows
1 1
1 1
4 4

The duplicate row will repeat

12. Left Outer Join in the same table?

It will get you the following rows given that the query looks like the one below

Select t1.Col1,t2.Col2
From Table1 t1 Left Outer Join Table2 t2 on (t1.Col1 = t2.Col2)

Left Outer Join Output

t1.Col1 t2.Col2
1 1
1 1
2 NULL
3 NULL
4 4

13. Add a NULL in the first table i.e. Table1
Table1[Col1] = 1,2,3,4,NULL
Table2[Col1] = 1,1,4,5

Table1 Left Outer Join Table2


Approach

The NULL will impact the number of rows being created

Output

t1.Col1 t2.Col1
1 1
1 1
2 NULL
3 NULL
4 4
NULL NULL

14. What is normalization?

Normalization is the process of minimizing redundancy and dependency by organizing fields and table of a database. The main aim of Normalization is to add, delete or modify field that can be made in a single table.

15. What is Denormalization.

DeNormalization is a technique used to access the data from higher to lower normal forms of database. It is also process of introducing redundancy into a table by incorporating data from the related tables.

16. What is a stored procedure?

Stored Procedure is a function consists of many SQL statement to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required.

17. What is a trigger?

A DB trigger is a code or programs that automatically execute with response to some event on a table or view in a database. Mainly, trigger helps to maintain the integrity of the database.

Example: When a new student is added to the student database, new records should be created in the related tables like Exam, Score and Attendance tables.

18. What is the difference between Cluster and Non-Cluster Index?

Clustered index is used for easy retrieval of data from the database by altering the way that the records are stored. Database sorts out rows by the column which is set to be clustered index.

A non-clustered index does not alter the way it was stored but creates a complete separate object within the table. It point back to the original table rows after searching.

19.What is Union, minus and Interact commands?

UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from the tables.

MINUS operator is used to return rows from the first query but not from the second query. Matching records of first and second query and other rows from the first query will be displayed as a result set.

INTERSECT operator is used to return rows returned by both the queries.

20. How can you create an empty table from an existing table?

Select * into Employee from Department where 0=1

This will ONLY copy the structure of the table

We will keep on adding tricky questions. But the last 4 days SQL queries should give you a good base to start with.

Keep Learning 🙂

The Data Monk

SQL advance concepts

Advance concepts mostly involves knowledge of window functions, cleaning the data format, working with different date formats, etc. Before proceeding, you should actually go through different ways of casting data types and other widows functions.


1.Create the cumulative revenue column on a restaurant data where the table contains revenue of each of the restaurants and you are supposed to create another column having running sum

select R1.RestaurantId, R1.Revenue, SUM(R2.Revenue) as Cumulative_Revenue
from Restaurant R1
inner join Restaurant R2 
on R1.RestaurantId >= R2.RestaurantId
group by 1,2
order by 1

2. What is the difference between group by and partition by?
Group by and partition by both are used on the complete dataset and is primarily used to apply functions on a different groups of the table.

On one hand, group by aims at reducing the number of rows of the table, but partition by does not reduce the number of rows.

Group by modifies the query but partition by works with ranking functions like row number(), rank(),dense_rank()

3. What are the different value related window function?

Lead() – The LEAD() window function returns the value for the row after the current row in a partition. If no row exists, null is returned
Lag() – The LAG() window function returns the value for the row before the current row in a partition. If no row exists, null is returned
Last_Value() – The FIRST_VALUE window function returns the value of the specified expression with respect to the first row in the window frame
First_Value() – The LAST_VALUE window function returns the value of the specified expression with respect to the last row in the window frame.

4. Use Lag function in a query to get the Lag column

RestaurantPriceLag
Oliver’s2000Null
PBN30002000
Social40003000
Harry’s50004000
SELECT Restaurant,
Price, 
lag(Price,1) over (Restaurant) as Lag 
from Restaurant

5. Use the Lead function to get a lead column

RestaurantPriceLead
Oliver’s20003000
PBN30004000
Social40005000
Harry’s5000 Null
SELECT Restaurant,
Price, 
lead(Price,1) over (Restaurant) as Lead 
from Restaurant

6. Show the use of First_Value()

RestaurantIdPriceFirst
Oliver’s120002000
PBN130002000
Social240004000
Harry’s250004000
To Be Continued260004000
Red Rhino370007000
Rasta Cafe380007000
Select Restaurant,
Id,
Price,
First_Value(Price) Over(Partition By Id) as First
From Restaurant

7. What are the aggregate window function?

Sum()
Average()
Count()
Min()
Max()
We are not going to discuss the definition or syntax of these functions

8. What are the ranking functions?
Ranking functions are important to keep in your mind. The most used ranking functions are:-

1. RANK()
2. DENSE_RANK()
3. ROW_NUMBER()
4. NTILE()

9. Which of the following group function ignores NULL value?(Tredence)
MAX
COUNT
SUM
All of the above

A. All of the above

10. What is the use of FETCH command?(Flipkart)
The FETCH argument is used to return a set of number of rows. FETCH can’t be used itself, it is used in conjuction with OFFSET.

SELECT column_name(s)
FROM table_name
ORDER BY column_name
OFFSET rows_to_skip
FETCH NEXT number_of_rows ROWS ONLY;

11. How to find count of duplicate rows?(Flipkart)

Select rollno, count (rollno) from Student
Group by rollno
Having count (rollno)>1
Order by count (rollno) desc;

12. You already know how to get the maximum and second maximum from a table. You also have a fair bit of idea to fetch the maximum from each department(as in form a table of employees with their department present in another column). Now fetch the second highest salary from each department

Select Dept, Max(Salary)
From (Select E1.Dept, E1.Salary From Emp E,Select (Dept,Max(Salary) as salary from Emp group by Dept)E1
Where E.Dept = E1.Dept
And (E.Salary < E1.Salary) X
Group By Dept

We will add more questions in the upcoming articles.

Keep Learning 🙂

The Data Monk

SQL Intermediate Questions

Let’s get started with some intermediate level SQL queries. Try to frame an approach before hoping to the solution because once you see the answer you will find it easy to understand.

I think there will be around 15 questions in the article below and do try to score at least 12+ before moving to the next article.

Day 2 – Introduction to SQL
Day 3 – SQL queries

1. You know why so many interviewers ask you if you know how to find the second and third and nth largest salary? It’s because it gives them a glimpse of your query understanding. We have already solved the question in the last day’s article, but I remember when I was interviewing for Myntra they asked me 3 different approaches to solve the same question. I knew that shit, so did good 😛

Get the third largest salary from the employee table(Myntra,Ola)

Schema –

Table Name – Employee
Columns – EmpId, EmpName,Salary

Solution/Approach



Select *
From Employee E1
Where (N-1) =
(Select Count(Distinct(EmpSalary))
From Employee E2
Where E2.EmpSalary > E1.EmpSalary)


The above query first take the distinct Salary’s count whenever the inner query is pulling salary greater than outer query. When you need 3rd largest salary then you need to eliminate the top 2 salaries which is done by the where condition in the inner query.

2. Well, the above query looks a bit difficult to me to understand, can you get me a better query which is easier to understand? (Myntra)

Sure, We can try a simpler and easy way to understand it.

Select Top 1 EmpSalary
From (Select Distinct Top N EmpSalary From Employee Order By EmpSalary    Desc) a
Order by EmpSalary

Here N will be 3, So the inner query will get me top 3 salaries ordered in a descending order and then the outer query will fetch the top result which is the 3rd largest salary.

3. I will give you complete marks if you can tell me one more method to do the same thing(Myntra)

Now I think I will have to make a temp table may be. Let’s see

 With CTE as 
 (Select EmpId,
 EmpName,
 EmpSalary,
 Row_Number() Over(Order By EmpSalary Desc) as Rank
 From Employee)
 
 Select * 
 From CTE
 Where Rank = N

The following questions have been repeated in at least 4 interviews I gave recently(Myntra,Ola,Affine Analytics and Unbxd)

You have two tables with one column each. The table A has 5 values and all the values are 1 i.e. 1,1,1,1,1 and Table B has 3 values and all the values are 1 i.e. 1,1,1.

How many rows will be there if we do the following:-

4. Inner Join

There will be 15 rows because each of the values from both the table will interact
I actually tried this in Python using Data Frame and following is the screenshot

5. Outer Join

Again there will be 15 rows because each values from both Table A and B will have to interact with each other

6. Left Join

Left Join will also be the same i.e. 15 rows. There is a catch in the order of the results though. Suppose we have two columns in each table

Table A
1 q
1 w
1 e
1 r
1 t

Table B
1 a
1 b
1 c

Left Join will give you 15 rows and the result will look like the one below

Doing Left Join Table A on Table B

Output

1 q a
1 q b
1 q c
1 w a
1 w b
1 w c

and so on.

7. Right Join

You must have guessed by now that the number of rows would be  15. The order of result would be

1 a q
1 a w
1 a e
1 a r
1 a t
1 b q

And so on.

Next is the classic question on self join and you gotta see this in your upcoming interviews. These questions were asked in the final round of EXL Services, Bangalore

8. There is a table with employee and manager information, the schema is given below. Write a query to get the EmployeeName and ManagerName. (Ola, Bangalore)

Schema

Table Name – Employee
Column Name – EmpId,EmpName,MangaerId


SELECT e1.Name EmployeeName, e2.name AS ManagerName
FROM Employee e1
INNER JOIN Employee e2
ON e1.ManagerId = e2.EmpId

9. What is one limitation in the above approach?(Ola, Bangalore)

If there is a NULL value in the ManagerId then it will not show up in the above table. To fix this we can have a  Left Join and can fill the NULL value with ‘NA’ or any other word.

SELECT e1.Name EmployeeName, ISNULL(e2.name, 'Top Manager') AS ManagerName
FROM Employee e1
LEFT JOIN Employee e2
ON e1.ManagerID = e2.EmployeeID

10. The Employee table have Department and Salary. I want to know the Maximum salary of each department

Schema –

Employee(Dept,Salary)

Solution


SELECT Dept,Max(Salary)
FROM
(SELECT E1.Dept, E1,Salary
FROM Employee E1,(SELECT Dept,Max(Salary) FROM Employee GROUP BY Dept)E2
WHERE E2.Salary > E1.Salary) E
GROUP BY Dept

 SELECT E1.Dept, E1,Salary
FROM Employee E1,SELECT Dept,Max(Salary) FROM Employee GROUP BY Dept


This will get you all but the maximum salary. So the final group by will be applied to the dataset which do not have the second maximum salary.

11. Use Dense Rank to get the third highest salary(Deloitte)

SELECT T as( 
SELECT EmployeeName,Dept,Salary, DENSE_RANK() Over (Partition By Dept Order By Salary Desc) as Dense_rank
FROM Employee)

SELECT *
FROM T
WHERE Dense_rank = 3

12. What if I want to get the top 5 salary information? What will you change in the above query?(Deloitte)


SELECT *
FROM T
WHERE Dense_rank <=5

13. What is the difference between heap table and temporary table?(Amazon)
Heap tables:
Heap tables are found in memory. They are used for high-speed storage on a temporary basis. They do not allow BLOB or TEXT fields.
Heap tables do not support AUTO_INCREMENT.
Indexes should be NOT NULL.

Temporary tables:
The temporary tables are used to keep the transient data. Sometimes it is beneficial in cases to hold temporary data. The Temporary table is deleted after the current client session terminates.

Main differences:
The heap tables are shared among clients while temporary tables are not shared.
Heap tables are just another storage engine, while for temporary tables you need a special privilege (create temporary table).

14. What is the result of following query?(Accenture)
select case when null=null then ‘Amit’ else ‘Rahul’ end from dual;

The null=null is always false.so the Answer of this query is Rahul.

15. What is the use of NVL function in Oracle?(Accenture)
NVL function is most important function to replace null value with another value.

Example:
select NVL(null,’Sachin’)
from dual;

which will give you output as Sachin.

16. Can we have another column in a table other than a primary key column which will act as a primary key?(Accenture)
Yes we can have another column in the table other than primary key which acts like primary key.But in database design this is not the recommended. One table can have only one primary key. But if you create other column with unique and not null constraint it acts like primary key.

17. Count the total salary department number wise where more than 2 employees exist.(BookMyShow)

SELECT deptno, sum(sal) As totalsal
FROM emp
GROUP BY deptno
HAVING COUNT(empno) > 2

18. How to fetch only common records from two tables emp and emp1?

(Select * from emp) Intersect (Select * from emp1)

I can go on and on with this article, but these are the snippets from my next book, so I need to hold my horses 😛

Will be back with some syntax heavy advance level interview questions

Keep Learning 🙂

The Data Monk

Basic Queries to get you started

I hope you have already read the introduction part where we discussed about the execution flow of all the clauses in SQL.

Today we will discuss around 15 questions which shall give you a good launching pad to directly jump on the complicated problems. You can, but you should not skip this article. Just go through it even if you are the king of SQL because “Why be a king, when you can be the God” (Comment if you know the song :P)

Your time is precious, so is mine. So, let’s start.

P.S. – Try to check the number of questions which you can answer by yourself(including the correct syntax). Check your score and it’s meaning at the end of the article

1. Select the maximum and second maximum salary from the employee table.

SELECT
  (SELECT MAX(Salary) FROM Emp) Max_Salary,
  (SELECT MAX(Salary) FROM Emp
  WHERE Salary NOT IN (SELECT MAX(Salary) FROM Emp)) as Second_max_salary

2. Get all employee detail from EmployeeDetail table whose “FirstName” not start with any single character between ‘a-p’

SELECT * 
FROM EmployeeDetail
WHERE FirstName like ‘[^a-p]%’

3. Write the query to get the branch and branch wise total(sum) admission fees, display it in ascending order according to admission fees.

Select Branch, 
SUM(Admission_Fee) as Total_admission_fees 
from Student
group by Branch 
order by SUM(Admission_fee) ASC

4. Get department name and average salary of all the employees (department wise)

Select Department, 
AVG(Salary) as Average_Salary 
From MuSigma
group by Department 
order by Department

5. Get department wise number of employees in Mu Sigma

Select Department,
COUNT(*)
FROM MuSigma
group by Department
order by Department

6. Will the code below work?

Select Name, Age as Actual_Age, Region
From Candidate
Where Actual_Age>35

The above query will not execute. Where clause should have actual field names

7. Find the 3rd Maximum salary in the employee table

Select distinct sal
from emp e1
where 3 = ((select count(distinct sal) from emp e2 where e1.sal <= e2.sal);

We need a table to understand Row_Number, Rank, etc.
The below table will be used in the next few queries

Emp. IDNameSalary
123Amit50000
453Sumit50000
232Rakshit30000
124Aman40000
543Rahul30000

8. What is ROW_NUMBER() function in SQL?
The ROW_NUMBER() ranking returns a unique and sequential number to
each row of the table without repeating or skipping any number. If there are two rows which are partitioned and holds the same value as Amit and Sumit, then the row number will be given randomly.

9. Write the syntax to create a new column using Row Number over the Salary column

SELECT *, ROW_NUMBER() OVER (Order By Salary) as Row_Num
FROM Employee

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300002
124Aman400003
123Amit500004
453Sumit500005

10. What is PARTITION BY clause?
PARTITION BY clause is used to create a partition of ranking in a table. If you partition by Salary in the above table, then it will provide a ranking based on each unique salary. Example below:-

SELECT *, ROW_NUMBER() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300002
124Aman400001
123Amit500001
453Sumit500002

11. What is a RANK() function? How is it different from ROW_NUMBER()?
– RANK() function gives ranking to a row based on the value on which you want to base your ranking. If there are equal values, then the rank will be repeated and the row following the repeated values will skip as many ranks as there are repeated values row. Confused?? Try out the example below:-

SELECT *, RANK() OVER (ORDER BY Salary) as Row_Num
FROM Employee

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300001
124Aman400003
123Amit500004
453Sumit500004

12. How to use PARTITION BY clause in RANK() function?
– There is no point using PARTITION BY clause with RANK() function. The ranking result will have no meaning, as the rank will be done according to Salary values per each partition, and the data will be partitioned according to the Salary values. And due to the fact that each partition will have rows with the same Salary values, the rows with the same Salary values in the same partition will be ranked with the value equal to 1. 

SELECT *, RANK() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num
From Employee
Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300001
124Aman400001
123Amit500001
453Sumit500001

13. What is Dense Ranking?
DENSE_RANK() is similar to the RANK() function but it does not skip any rank, so if there are two equal values then both will be termed as 1, the third value will be termed as 3 and not 2.

SELECT *, 
DENSE_RANK() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num
FROM Employee

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300001
124Aman400002
123Amit500003
453Sumit500003
432Nihar600004

14. What is NTILE() function?
NTILE() is similar to percentile NTILE(3) will divide the data in 3 parts. 

SELECT *, NTILE() OVER (ORDER BY Salary) as Ntile
FROM Employee

The number of rows should be 6/3 = 2, therefore we need to divide the 2 rows for each percentile

Emp. IDNameSalaryNtile
232Rakshit300001
543Rahul300001
124Aman400002
123Amit500002
453Sumit500003
432Nihar600003

15. Can we use HAVING without GROUP BY?
Have you heard it lately that Having can be used only with group by?
It’s actually true, but can we write a query using Having without group by?

Yes we can, but it won’t get you anything fruitful 😛

Select * 
From Employee
Having 1=1

Having is used to filter groups .
Where clause is used to filter rows.

12+ correct answers -> Good enough to move to the next article
8-12 correct answers -> Go through sqlzoo website to chisel your skills
4-8 correct answers -> Go through either tutorialspoint or w3school before proceeding
0-4 -> Bhai kar kya rhe ho jindgi ke saath ? 100-200 leke koi nya course join kr lo 😛

We will discuss some more questions in the next article.

Keep Learning 🙂

The Data Monk

Introduction to SQL

Structured Query Language is the base of your Data Science career. You will always be surrounded with SQL queries and it’s very important to understand the basics of the language.

We will not waste time on defining terminologies associated with the language. If you want to start from the very basic then do go through www.tutorialspoint.com.

We will start with the most asked question in a Data Science interview.

“Okay Aman, tell me the sequence of SQL command execution”

And Aman starts remembering what he has by-hearted last night 😛

Let’s make this easy to understand.

Where does your data comes from ?
Table, right?
How many tables? Any number, depending on the number of joins.

So, first of all the query will execute the “FROM” and all the “JOIN” tables, because you need to extract some data to process it, right?

So, the first command to run is FROM and JOIN.

Now, i have the mammoth looking data dump(10,000 rows), what next?
I will apply a WHERE condition, remember, Where condition cuts down the number of “rows” from the table.

The second command to get executed is “WHERE”.

Now i have all the rows which i need, Group by simply groups the data on one or more field. Cool?

I have 50,000 rows of 50 students. I apply a group by and the number of rows is reduced to 1000.

GROUP BY is the next command to be executed after WHERE clause.

FROM,JOIN -> WHERE -> GROUP BY

What else do we have?
Ummm, We have HAVING, LIMIT,SELECT,ORDER BY

We all know that HAVING is always used along with Group by to apply condition on aggregated value. If you are unsure about the difference between WHERE and HAVING command then, God bless you 😛

Well WHERE is applied on rows and HAVING is applied on aggregate value like – HAVING sum(Marks) > 1000

So, HAVING further applies condition on your group by. Thus HAVING follows Group by

FROM,JOIN -> WHERE -> GROUP BY -> HAVING

Now we have SELECT,LIMIT and ORDER BY

We have filtered out everything, now is the time to SELECT the columns which you need.

You have the selected columns, which you will order and then apply Limit to it.

There is no point ordering 10 columns and then selecting 4 ordered columns, right?

You first limit the columns, then apply the order and then if there is a need to apply a limit to the fetched record, apply a LIMIT

Thus, the final order is

FROM,JOIN
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
LIMIT


Read it twice if you are unable to get the concept. You should be able to recite the whole story even at 3 in the morning 😛

2. What is fetch and offset?
This might look fancy but the dictionary meaning will do the trick. Offset is like, “how many rows you want to chuck” and fetch is like “how many rows do you want to include”

So If my table has data of top 100 students.
OFFSET 10 FETCH 5 will get you the 11th to 15th rank student.

Where will these two clause fall in the execution ?
FETCH and OFFSET are parallel to the LIMIT command

3. What is the difference between ‘HAVING’ CLAUSE and a ‘WHERE’ CLAUSE?
HAVING clause can be used only with SELECT statement. 
It is usually used in a GROUP BY clause and whenever GROUP BY is not used, HAVING behaves like a WHERE clause. 
Having Clause is only used with the GROUP BY function in a query 
whereas WHERE Clause is applied to each row before they are a part of the GROUP BY function in a query.

4. What is self join?
 The SQL SELF JOIN is used to join a table to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement.

SELECT a.column_name, b.column_name…
FROM table1 a, table1 b
WHERE a.common_field = b.common_field; 

5.
SELECT – select records from a table
INSERT – insert new records
UPDATE – update/Modify existing records
DELETE – delete existing records 
CREATE – create a new Table, database, schema
ALTER – alter existing table, column description
DROP – delete existing objects from database
GRANT – allows users to read/write on certain database objects
REVOKE – keeps users from read/write permission on database objects

6. A field with a NULL value is a field with no value. It is very important to understand that a NULL value is different than a zero value or a field that contains spaces.

Yahan light nai le lena interview main

7. Like clause is an important concept in SQL. Whenever you want to apply a like clause in a dataset with varchar datatype then do cast everything in either lower or upper case.

lower(name) like ‘nitin kamal’

is always preferred over

name like ‘Nitin Kamal’

Your code should be able to minimize margin of errors.

You should definitely try SQLZOO to practice SQL queries. The link is given below:-

What ?? Go and Google it daa 😛

Keep Learning 🙂

The Data Monk





Supply Chain Analytics – Using PuLP in Python

There are three types of programming which we can do in Supply Chain

a. Linear Programming – It involves creating a model on continuous variables
b. Integer Programming – It involves creating a model on only Discrete or Integer value
c. Mixed Integer Programming – It is a mix of continuous and discrete variables

What is PuLP?
PuLP is a Python Library that enables users to describe mathematical programs. PuLP works entirely within the syntax and natural idioms of the Python language by providing Python objects that represent optimization problems and decision variables, and allowing constraints to be expressed in a way that is very similar to the original mathematical expression.

PuLP has focused on supporting linear and mixed-integer models.

We will be using PuLP to solve some Supply Chain Problems

Introduction to PuLP in Supply Chain Analytics

PuLP as you know is an Integer Programming/Linear Programming Modeler. There are three parts of creating a model in PuLP:-
a. Decision Variables – These are the variables which impacts the Supply Chain. For example, Number of pressure cooker is a decision variable to cook Rice. More number of Pressure cooker will help you in cooking more rice
b. Objective Functions – These are the mathematical equations to use Decision variables to optimize your process i.e. either maximize or minimize something. Example, You have 3 pressure cooker and 2 Stoves, and you have to cook Rice and Lentils. Now you need to figure out an objective function to cook maximum rice and lentils in 1 hour using the above utensils
c. Constraints – These are the things which limits our ideal or optimized solution.

Let’s take a case study of Supply Chain optimization.

There is a Restaurant which serves Mega Pizza (40”).  It has one oven, 3 bakers, and 1 packer. Following is the time required by each Pizza

  Number Pizza A Pizza B Pizza C Working Days
Oven 1 Oven 1 Day 0.5 Day 1 Day 30 Days
Baker 3 Bakers 1 Day 2 Days 2 Days 30 Days
Packer 2 Packers 1 Day 1 Day 1 Day 20 Days
Profit   $30 $40 $50  

Now you have to maximize the Profit using PuLP library. Use decision variables, objective functions, and constraints.

How much pizza of each type should we make in 30 days.

First let’s look into the coding part in Python

from pulp import *
model = LpProblem(“Maximize Pizza Profit”, LpMaximize)

#Declare Decision Variable
A = LpVariable(‘A’,lowbound=0,upbound = None,cat=’Integer’)
B = LpVariable(‘B’,lowbound=0, upbound = None, cat=’Integer’)
C = LpVariable(‘C’,lowbound=0,upbound = None, cat=’Integer’)


#Define Objective function
#For Oven
model += 1*A + 0.5*B + 1*C <=  30
#For Baker
model += 1*A+2*B+2*C <=90
#For Packer
model += 1*A+1*B+1*C <= 40

#Solve Model
model.solve()
print(“Produce {} Pizza A”.format(A.varValue))
print(“Produce {} Pizza B”.format(B.varValue))
print(“Produce {} Pizza C”.format(C.varValue))


Now let’s understand the code

from pulp import *
Here you are importing the complete package

model = LpProblem(“Maximize Pizza Profit”, LpMaximize)
Here you are defining the model using LpProblem function. The LpMaximize will look for maximizing the value i.e. Profit. If you want to get the minimum value from the model then use LpMinimize. We can use LpMinimize when we are talking about reducing the wastage.

A = LpVariable(‘A’,lowbound=0,upbound = None,cat=’Integer’)
Here we define each Variable using LpVariable function. Lowbound refers to the lowest possible value of the variable.
Pizza can not be negative so we have given the value 0, Upbound is the maximum value of the variable.
None will ensure that the upbound could be anything
cat is the characteristic of the variable. It could be integer, categorical, or Binary



model += 1*A + 0.5*B + 1*C <=  30
This is the constraint for Oven. A requires 1 day, B requires 0.5 Day, and C requires 1 Day. The <=30 is the constraint which is because there is one oven which will work for 30 days

model += 1*A+2*B+2*C <=90
Similar to the above, the Baker will need 1, 2, and 2 days for A,B, and C respectively. And there are 3 Bakers which work 30 days. Thus constraint is 30*3 = 90

#For Packer
model += 1*A+1*B+1*C <= 40

A packer takes 1,1,and 1 day for A,B, and C pizza. And there are 2 Packers who  works 20 days each. Thus constraint is 40.

This is a simple example with fixed constraints.

Keep Learning 🙂

Tha Data Monk

Supply Chain Analytics

We all have a fair idea about the supply chain. In a layman term, we can say that the supply chain analytics helps in improving the operational efficiency and effectiveness by providing “data-driven” decisions at the operational and strategic level.

There are four types of analytics which one can perform to boost the performance of a supply-chain enabled business:-

Kartik had a mild migraine a few days ago, he ignored it and continued with his daily routine. After a few days, he found out that the pain is getting worse with time, He consulted Dr.Nikhil, who first asked for his medical history/reports i.e. Weight, Sugar-level, Bloop pressure, etc.
Then he looked into the reports and tried to diagnose the reason behind this abrupt pain.
The ache was there all the time which made Doctor believe that it is bound to happen in the future, so after looking at all the major points, Nikhil prescribed some medicine to Kartik.

What is what?

Reports ~ KPIs of the industry and business i.e. Descriptive Analytics
Diagnosis ~ Looking for reasons for the numbers present in the report i.e. Diagnostic analytics
Prediction of future pain to Kartik ~ Predictive analytics
Prescribing medicine ~ Looking at all the past behavior and KPIs, we do Prescriptive analytics


1. Descriptive analytics – So, you have some historic data and you need to find the performance of KPIs, this type of analysis is called descriptive analysis. The descriptive analysis helps us in finding answers to questions like How many products were sold, the performance of products in different stores, the performance of stores in terms of revenue, etc.

Basically, it gives you a gist of the current state of a company

2. Diagnostic analytics – On one hand, the descriptive analysis tells you about the KPIs of the company, whereas the diagnostic analytics tells you a lot about the underlying issue. If the descriptive analysis tells you that
Product A is not performing well in the Whitefield Store of Walmart, then the diagnostic analysis will aim at finding the underlying reasons for the same.

3. Predictive analytics –

“Do you understand the difference between Forecasting and prediction?”
Forecasting is the use of historic data which holds some pattern, to give a number for the future i.e. you are basically extrapolating the past pattern to get the numbers for the future. Whereas prediction is a more vague term which takes the changes of future in the account.

When I go through the last 40 months of data to estimate the number of joints rolled by Aman in the next month, then this is a case of forecasting. But, if I read the palm of Ramesh and tells him his future by considering the present and future behavior of the stars, then it’s a prediction.

Predictive analytics uses statistical techniques to estimate the likelihood of future events such as stock-outs or movements in your product’s demand curve. It provides the foresight for focused decision making that avoids likely problems in the future.

4. Prescriptive Analytics – Now it’s the time to have an overview of all the analytics components and provide solutions which can improve the performance of the business. This is done by prescriptive analytics.

Descriptive talks about the KPIs, diagnostic tries to find out the reason behind these numbers, predictive wants to know the performance of the business by looking at the historic and futuristic actions, prescriptive provides the final prescriptions !!

Components of Supply Chain Analytics:-

Overall, supply chain analytics can be divided into 5 parts:-

1. Inventory Management – This part looks after the “store-house” of a company. The major parts of analytics here are

a. Inventory Cost Optimization
b. Optimal Stocking
c. Stock-out Prediction

2. Sourcing – How to full fill the demand

a. Optimized Order Allocation
b. Arrival time optimization
c. Sourcing cost analysis

3. Vendor Management – How to optimize vendors for your company

a. Fault Rate Analysis
b. Profitability Analysis
c. Vendor Scorecard

4. Returns Management – What to do if a product is returned?

a. Returns tracking
b. Salvaging optimization
c. Cost Recovery Analysis

5. Network Planning – How to optimize the transport network to maximize profit?

a. Trailer Utilization
b. Freight Cost Optimization
c. Vehicle Routing

What are the five stages of Supply Chain?

You can divide the whole Supply chain process in 5 stages
a. Plan – Every company needs a strategy on how to manage the resources in order to achieve their customers demand for their products and services
b. Source – To create their products, companies need to be very careful when choosing suppliers to deliver their goods and services needed
c. Make – In manufacturing the supply chain manager should always schedule the activities that are needed for the production, packaging, testing and preparation for delivery.
d. Deliver – This part is mainly referred to as logistics by the supply chain management. In this case companies coordinate receipts of orders, pick carriers to get products to customers and develop a network of warehouses.
e. Return – In many companies this is usually where the problem is – in the supply chain. The planners should create a flexible and responsible network for receiving a flaw and excess products sent back to them (from customers).

Common Terminologies in Supply Chain

1. Back Ordering – When you don’t have product in your inventory and the product has already been ordered by a customer. In this case you give the order to a supplier. This is called Back-Ordering

2. Blanket Order – It is a large purchase order registered by the end user which the supplier has to supply in a span of few days where the dates are not fixed. It’s just like saying “I need 5000 Light candles before October 31st”. This will ensure a large order aiming for a good amount of discount before a festive or high demand season

3. Consignment –  This term has more than one meaning. Most often it means the act of placing your goods in the care of a third-party warehouse owner (known as the consignee) who maintains them for a fee. In addition to storing the goods, the consignee may sell or ship them to customers on your behalf for a fee. As a separate meaning, consignments can also refer to individual shipments made to the consignee.

4. Drop Shipment – You create a website and listed few things which are present in a nearby store. As soon as an order is placed on your website, you give the order to the nearby mart to deliver it to the customer’s place. Your profit is the difference between price paid by the customer and delivery+product cost of the mart. Here you do not need an inventory, in fact you do not need any store house or capital investment to start an e-commerce business

5. Groupage –  This is a method of grouping multiple shipments from different sellers (each with its own bill of lading) inside a single container. This is done when individual shipments are less than the container load or in other words are not big enough by themselves to fill up an entire container. This way, the freight cost is split between these sellers.

6. JIT – Just-in-time is an inventory optimization method where every batch of items arrives ”just in time” to fulfil the needs of the next stage, which could be either a shipment or a production cycle.

7. Landed Cost –  The total cost of ownership of an item. This includes the cost price, shipping charges, custom duties, taxes and any other charges that were borne by the buyer.

8. Waybill: A document prepared by the seller, on behalf of the carrier, that specifies the shipment’s point of origin, the details of the transacting parties (the buyer and seller), the route, and the destination address.


You can look for more definitions and KPIs related to Supply chain. But, this is a decent way to start the exploration.

We will deal with implementing a simple Supply Chain problem using PuLP in Python in our next article.

Keep Learning 🙂

The Data Monk