Day 9 – EDA in Python

The complete Machine Learning journey can be penned down in 4 steps:-

1. Exploratory Data Analysis This is the first thing you do when you get a dataset. Before jumping on to building models, you need to first understand the nature of the data. EDA helps in making it easier for the audience to get along with data.

EDA includes visualizing the raw data, looking for correlation in the dataset and finding missing values in the data set. In short you have to plot a hell lot of graphs to understand the dataset.

2. Cleaning the data – You will spend more than 50 percent of your time in cleaning the data and doing missing value treatment. Cleaning is important because the accuracy of your model will depend on the number of proper data points.

3. Building models – We are talking about Machine Learning algorithms, so, once you have the clean data, you need to build models, visualize results, check the results and improve the success metric of the model

4. Result Presentation – You have the results of the model. This result is of no use until and unless it is consumed by the audience. You will again need the power of visualizations to prove the result of your analysis

We will take us a data set and will build some graphs from scratch.
I will be using the Titanic Data set because of the following reasons:-
1. Firstly, It is freely and easily available at this link – https://www.kaggle.com/c/titanic/data
2. It’s clean and easy to understand

Gist of the dataset – Titanic train dataset contains the various information(like. age, sex, family size of the passenger, etc.) about those who survived and who could not survive

Day 7 – Start with Python

Python is one of the most preferred language for Data Science. It is needless to discuss the pros and cons of Python over any other language(R/SAS/Java/C).

If you are already comfortable with any other language then it’s good, but if you are still exploring and are ready to pick a new language, then Python it is.

At the time of writing this blog, two versions of Python are popular
Python 2.7
Python 3.*

Let’s start with installing the snake.

1. All you need is – Anaconda
Once you have it in your system, install it and you will be able to execute python codes. I would recommend using Jupyter notebook, this was you will be able to document your work and analysis.

How to launch Jupyter Notebook?
Once you have installed Anaconda, you will get an Anaconda Navigator in your start menu or on your desktop.
Double click to open it.

This is how Anaconda Navigator will look like. Click on the Launch button below the Jupyter Notebook

The Jupyter notebook will look something like the one below:

Click on the new button and select Python 3(if Python 3 has been installed)

Running your first Python program

What do you want Python to print?
“Hello World” ??

Yeah !! That’s how we start

Following libraries are the bread and butter of a Data Scientist:-
1. Matplot – To visualize your data
2. Numpy – In eases the complex mathematical operations
3. Pandas – It provides data structure of a high level and a very good variety of tools for analysis
4. SciPy – For Machine Learning
5. Pytorch – To provide tensor computation
6. Keras – Talks about Neural Network in detail
7. Scikit- It holds a lot of unsupervised learning algorithms

To start with, we will use Matplot, Numpy, Pandas and Scikit.

Though I recommend solving at least 25 questions from any of the below websites to make you comfortable with functions, but you can chuck this part and jump directly on the Data Analysis part
1. Codechef
2. Spoj
3. HackerRank

If you want to get a head start in Python, then you can go through either w3school or tutorials point first. But, in case you want to learn it in a more practical way, you can stick to this website.

In the coming few days, we will quickly jump from basic data types and structures, and will start exploring few algorithms on really small dataset. The reason why We will be using small datasets is that you can actually visualize the effect of each algorithm.

We will then move to solving a couple of Hackathons.

See, it’s very irritating to write or explain the different data types of a new language. We will try to keep it simple and crisp, once you start using these is when you will understand it better.

Data Types:-
1. Number – Daaahh !! It’s just numbers(int, long, float and complex)
2. String – Same old story. Remember, index starts with 0
3. List –
-A list can contain anything
-It also starts with index 0
-It is MUTABLE
– Square bracket

list1 = ['alpha', 'beta', 'Shaktiman', 8382]
list2 = ["d", "a", "t", "a"]

4. Tuples –
-Same as List
-Tuple is IMMUTABLE
-Starts with index 0
– Round bracket

tup1 = ('alpha', 'beta', 'Shaktiman', 8382)
tup2 = ("d", "a", "t", "a")

5. Dictionary
– Contains key and values where key is the identifier and value is the value. Example – Abba:6132, Dabba:6292,Jabba:6002
-Curly braces

Student = {
  "Abba": 6132,
  "Dabba":6292,
  "Jabba":6002
}

Dekho, itna ho gya..Baaki jab time aaeyga toh khud sheekh jaaoge..Haan agar aur kuch specific malum krna ho toh comment daal dena

Will learn python in a question answer way in the next article

Keep Learning 🙂

The Data Monk

Day 6 – Less asked SQL questions

We have already covered queries on joins, aggregate functions, sub-queries, running sum, etc.

This article will concentrate on the theoretical part of SQL which are less asked but are important to know about

1.What are the different types of statements supported in SQL?

There are three types of statements in SQL:-
a. DDL – Data Definition Language
b. DML – Data Manipulation Language
c. DCL – Data Control Language

2. What is DDL?

It is used to define the database structure such as tables. It includes 3 commands:-
a. Create – Create is for creating tables 
CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ….
); 
b. Alter – Alter table is used to modifying the existing table object in the database.
ALTER TABLE table_name
ADD column_name datatype
c. Drop – If you drop a table, all the rows in the table is deleted and the table structure is removed from the database. Once the table is dropped, you can’t get it back

3. What is DML?

Data Manipulation Language is used to manipulate the data in records. Commonly used DML commands are Update, Insert and Delete. Sometimes SELECT command is also referred as a Data Manipulation Language.

4. What is DCL?

Data Control Language is used to control the privileges by granting and revoking database access permission

5. What is Normalization? How many Normalization forms are there?

Normalization is used to organize the data in such manner that data redundancy will never occur in the database and avoid insert, update and delete anomalies.
There are 5 forms of Normalization
First Normal Form (1NF): It removes all duplicate columns from the table. Creates a table for related data and identifies unique column values
Second Normal Form (2NF): Follows 1NF and creates and places data subsets in an individual table and defines the relationship between tables using a primary key
Third Normal Form (3NF): Follows 2NF and removes those columns which are not related through primary key
Fourth Normal Form (4NF): Follows 3NF and do not define multi-valued dependencies. 4NF also known as BCNF

6. Full Outer Join is a combination of which of the following:-
a. Left Outer and Left Inner Join
b. Left Outer and Right Inner Join
c. Left Outer and Right Outer Join
d. Left Outer and Right Outer Join


A. Full Outer Join is a combination of Left Outer and Right Outer Join in SQL

7. What is unnest in SQL?

In General, unnesting comes into picture when you are dealing with Big Data because there you will have data in different formats. You might have to deal with an array in a particular column.
Array is nothing but a sequence of data points under one heading.

Let’s take an example of creating an array in BigQuery

Select ['Jan','Feb','Mar'] as Month
Select [0,0,1] as Bool
SELECT *
FROM UNNEST(['a', 'b', 'c', 'd', 'e', 'f'])AS alpha
WITH OFFSET AS offset
ORDER BY offset;

The above code will create the following table

alpha offset
a 0
b 1
c 2
d 3
e 4
f 5

8. What is RDBMS?

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables.

9. What is NoSQL ?
NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale. NoSQL database is used for distributed data stores with humongous data storage needs.

10. What is a unique key?

A Unique key constraint uniquely identified each record in t
A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.
There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

11. What is a foreign key?

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

11. Table 1 has only one column ‘Col1’ having values 1,2,3,4
Table2 has only one column ‘Col2’ having values 1,1,4,6

Inner Join on the two table will get you how many rows?


Inner Join will get you 3 rows
1 1
1 1
4 4

The duplicate row will repeat

12. Left Outer Join in the same table?

It will get you the following rows given that the query looks like the one below

Select t1.Col1,t2.Col2
From Table1 t1 Left Outer Join Table2 t2 on (t1.Col1 = t2.Col2)

Left Outer Join Output

t1.Col1 t2.Col2
1 1
1 1
2 NULL
3 NULL
4 4

13. Add a NULL in the first table i.e. Table1
Table1[Col1] = 1,2,3,4,NULL
Table2[Col1] = 1,1,4,5

Table1 Left Outer Join Table2


Approach

The NULL will impact the number of rows being created

Output

t1.Col1 t2.Col1
1 1
1 1
2 NULL
3 NULL
4 4
NULL NULL

14. What is normalization?

Normalization is the process of minimizing redundancy and dependency by organizing fields and table of a database. The main aim of Normalization is to add, delete or modify field that can be made in a single table.

15. What is Denormalization.

DeNormalization is a technique used to access the data from higher to lower normal forms of database. It is also process of introducing redundancy into a table by incorporating data from the related tables.

16. What is a stored procedure?

Stored Procedure is a function consists of many SQL statement to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required.

17. What is a trigger?

A DB trigger is a code or programs that automatically execute with response to some event on a table or view in a database. Mainly, trigger helps to maintain the integrity of the database.

Example: When a new student is added to the student database, new records should be created in the related tables like Exam, Score and Attendance tables.

18. What is the difference between Cluster and Non-Cluster Index?

Clustered index is used for easy retrieval of data from the database by altering the way that the records are stored. Database sorts out rows by the column which is set to be clustered index.

A non-clustered index does not alter the way it was stored but creates a complete separate object within the table. It point back to the original table rows after searching.

19.What is Union, minus and Interact commands?

UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from the tables.

MINUS operator is used to return rows from the first query but not from the second query. Matching records of first and second query and other rows from the first query will be displayed as a result set.

INTERSECT operator is used to return rows returned by both the queries.

20. How can you create an empty table from an existing table?

Select * into Employee from Department where 0=1

This will ONLY copy the structure of the table

We will keep on adding tricky questions. But the last 4 days SQL queries should give you a good base to start with.

Keep Learning 🙂

The Data Monk

Day 5 – SQL advance concepts

Advance concepts mostly involves knowledge of window functions, cleaning the data format, working with different date formats, etc. Before proceeding, you should actually go through different ways of casting data types and other widows functions.


1.Create the cumulative revenue column on a restaurant data where the table contains revenue of each of the restaurants and you are supposed to create another column having running sum

select R1.RestaurantId, R1.Revenue, SUM(R2.Revenue) as Cumulative_Revenue
from Restaurant R1
inner join Restaurant R2 
on R1.RestaurantId >= R2.RestaurantId
group by 1,2
order by 1

2. What is the difference between group by and partition by?
Group by and partition by both are used on the complete dataset and is primarily used to apply functions on a different groups of the table.

On one hand, group by aims at reducing the number of rows of the table, but partition by does not reduce the number of rows.

Group by modifies the query but partition by works with ranking functions like row number(), rank(),dense_rank()

3. What are the different value related window function?

Lead() – The LEAD() window function returns the value for the row after the current row in a partition. If no row exists, null is returned
Lag() – The LAG() window function returns the value for the row before the current row in a partition. If no row exists, null is returned
Last_Value() – The FIRST_VALUE window function returns the value of the specified expression with respect to the first row in the window frame
First_Value() – The LAST_VALUE window function returns the value of the specified expression with respect to the last row in the window frame.

4. Use Lag function in a query to get the Lag column

RestaurantPriceLag
Oliver’s2000Null
PBN30002000
Social40003000
Harry’s50004000
SELECT Restaurant,
Price, 
lag(Price,1) over (Restaurant) as Lag 
from Restaurant

5. Use the Lead function to get a lead column

RestaurantPriceLead
Oliver’s20003000
PBN30004000
Social40005000
Harry’s5000 Null
SELECT Restaurant,
Price, 
lead(Price,1) over (Restaurant) as Lead 
from Restaurant

6. Show the use of First_Value()

RestaurantIdPriceFirst
Oliver’s120002000
PBN130002000
Social240004000
Harry’s250004000
To Be Continued260004000
Red Rhino370007000
Rasta Cafe380007000
Select Restaurant,
Id,
Price,
First_Value(Price) Over(Partition By Id) as First
From Restaurant

7. What are the aggregate window function?

Sum()
Average()
Count()
Min()
Max()
We are not going to discuss the definition or syntax of these functions

8. What are the ranking functions?
Ranking functions are important to keep in your mind. The most used ranking functions are:-

1. RANK()
2. DENSE_RANK()
3. ROW_NUMBER()
4. NTILE()

9. Which of the following group function ignores NULL value?(Tredence)
MAX
COUNT
SUM
All of the above

A. All of the above

10. What is the use of FETCH command?(Flipkart)
The FETCH argument is used to return a set of number of rows. FETCH can’t be used itself, it is used in conjuction with OFFSET.

SELECT column_name(s)
FROM table_name
ORDER BY column_name
OFFSET rows_to_skip
FETCH NEXT number_of_rows ROWS ONLY;

11. How to find count of duplicate rows?(Flipkart)

Select rollno, count (rollno) from Student
Group by rollno
Having count (rollno)>1
Order by count (rollno) desc;

12. You already know how to get the maximum and second maximum from a table. You also have a fair bit of idea to fetch the maximum from each department(as in form a table of employees with their department present in another column). Now fetch the second highest salary from each department

Select Dept, Max(Salary)
From (Select E1.Dept, E1.Salary From Emp E,Select (Dept,Max(Salary) as salary from Emp group by Dept)E1
Where E.Dept = E1.Dept
And (E.Salary < E1.Salary) X
Group By Dept

We will add more questions in the upcoming articles.

Keep Learning 🙂

The Data Monk

Day 4 – SQL Intermediate Questions

Let’s get started with some intermediate level SQL queries. Try to frame an approach before hoping to the solution because once you see the answer you will find it easy to understand.

I think there will be around 15 questions in the article below and do try to score at least 12+ before moving to the next article.

Day 2 – Introduction to SQL
Day 3 – SQL queries

1. You know why so many interviewers ask you if you know how to find the second and third and nth largest salary? It’s because it gives them a glimpse of your query understanding. We have already solved the question in the last day’s article, but I remember when I was interviewing for Myntra they asked me 3 different approaches to solve the same question. I knew that shit, so did good 😛

Get the third largest salary from the employee table(Myntra,Ola)

Schema –

Table Name – Employee
Columns – EmpId, EmpName,Salary

Solution/Approach



Select *
From Employee E1
Where (N-1) =
(Select Count(Distinct(EmpSalary))
From Employee E2
Where E2.EmpSalary > E1.EmpSalary)


The above query first take the distinct Salary’s count whenever the inner query is pulling salary greater than outer query. When you need 3rd largest salary then you need to eliminate the top 2 salaries which is done by the where condition in the inner query.

2. Well, the above query looks a bit difficult to me to understand, can you get me a better query which is easier to understand? (Myntra)

Sure, We can try a simpler and easy way to understand it.

Select Top 1 EmpSalary
From (Select Distinct Top N EmpSalary From Employee Order By EmpSalary    Desc) a
Order by EmpSalary

Here N will be 3, So the inner query will get me top 3 salaries ordered in a descending order and then the outer query will fetch the top result which is the 3rd largest salary.

3. I will give you complete marks if you can tell me one more method to do the same thing(Myntra)

Now I think I will have to make a temp table may be. Let’s see

 With CTE as 
 (Select EmpId,
 EmpName,
 EmpSalary,
 Row_Number() Over(Order By EmpSalary Desc) as Rank
 From Employee)
 
 Select * 
 From CTE
 Where Rank = N

The following questions have been repeated in at least 4 interviews I gave recently(Myntra,Ola,Affine Analytics and Unbxd)

You have two tables with one column each. The table A has 5 values and all the values are 1 i.e. 1,1,1,1,1 and Table B has 3 values and all the values are 1 i.e. 1,1,1.

How many rows will be there if we do the following:-

4. Inner Join

There will be 15 rows because each of the values from both the table will interact
I actually tried this in Python using Data Frame and following is the screenshot

5. Outer Join

Again there will be 15 rows because each values from both Table A and B will have to interact with each other

6. Left Join

Left Join will also be the same i.e. 15 rows. There is a catch in the order of the results though. Suppose we have two columns in each table

Table A
1 q
1 w
1 e
1 r
1 t

Table B
1 a
1 b
1 c

Left Join will give you 15 rows and the result will look like the one below

Doing Left Join Table A on Table B

Output

1 q a
1 q b
1 q c
1 w a
1 w b
1 w c

and so on.

7. Right Join

You must have guessed by now that the number of rows would be  15. The order of result would be

1 a q
1 a w
1 a e
1 a r
1 a t
1 b q

And so on.

Next is the classic question on self join and you gotta see this in your upcoming interviews. These questions were asked in the final round of EXL Services, Bangalore

8. There is a table with employee and manager information, the schema is given below. Write a query to get the EmployeeName and ManagerName. (Ola, Bangalore)

Schema

Table Name – Employee
Column Name – EmpId,EmpName,MangaerId


SELECT e1.Name EmployeeName, e2.name AS ManagerName
FROM Employee e1
INNER JOIN Employee e2
ON e1.ManagerId = e2.EmpId

9. What is one limitation in the above approach?(Ola, Bangalore)

If there is a NULL value in the ManagerId then it will not show up in the above table. To fix this we can have a  Left Join and can fill the NULL value with ‘NA’ or any other word.

SELECT e1.Name EmployeeName, ISNULL(e2.name, 'Top Manager') AS ManagerName
FROM Employee e1
LEFT JOIN Employee e2
ON e1.ManagerID = e2.EmployeeID

10. The Employee table have Department and Salary. I want to know the Maximum salary of each department

Schema –

Employee(Dept,Salary)

Solution


SELECT Dept,Max(Salary)
FROM
(SELECT E1.Dept, E1,Salary
FROM Employee E1,(SELECT Dept,Max(Salary) FROM Employee GROUP BY Dept)E2
WHERE E2.Salary > E1.Salary) E
GROUP BY Dept

 SELECT E1.Dept, E1,Salary
FROM Employee E1,SELECT Dept,Max(Salary) FROM Employee GROUP BY Dept


This will get you all but the maximum salary. So the final group by will be applied to the dataset which do not have the second maximum salary.

11. Use Dense Rank to get the third highest salary(Deloitte)

SELECT T as( 
SELECT EmployeeName,Dept,Salary, DENSE_RANK() Over (Partition By Dept Order By Salary Desc) as Dense_rank
FROM Employee)

SELECT *
FROM T
WHERE Dense_rank = 3

12. What if I want to get the top 5 salary information? What will you change in the above query?(Deloitte)


SELECT *
FROM T
WHERE Dense_rank <=5

13. What is the difference between heap table and temporary table?(Amazon)
Heap tables:
Heap tables are found in memory. They are used for high-speed storage on a temporary basis. They do not allow BLOB or TEXT fields.
Heap tables do not support AUTO_INCREMENT.
Indexes should be NOT NULL.

Temporary tables:
The temporary tables are used to keep the transient data. Sometimes it is beneficial in cases to hold temporary data. The Temporary table is deleted after the current client session terminates.

Main differences:
The heap tables are shared among clients while temporary tables are not shared.
Heap tables are just another storage engine, while for temporary tables you need a special privilege (create temporary table).

14. What is the result of following query?(Accenture)
select case when null=null then ‘Amit’ else ‘Rahul’ end from dual;

The null=null is always false.so the Answer of this query is Rahul.

15. What is the use of NVL function in Oracle?(Accenture)
NVL function is most important function to replace null value with another value.

Example:
select NVL(null,’Sachin’)
from dual;

which will give you output as Sachin.

16. Can we have another column in a table other than a primary key column which will act as a primary key?(Accenture)
Yes we can have another column in the table other than primary key which acts like primary key.But in database design this is not the recommended. One table can have only one primary key. But if you create other column with unique and not null constraint it acts like primary key.

17. Count the total salary department number wise where more than 2 employees exist.(BookMyShow)

SELECT deptno, sum(sal) As totalsal
FROM emp
GROUP BY deptno
HAVING COUNT(empno) > 2

18. How to fetch only common records from two tables emp and emp1?

(Select * from emp) Intersect (Select * from emp1)

I can go on and on with this article, but these are the snippets from my next book, so I need to hold my horses 😛

Will be back with some syntax heavy advance level interview questions

Keep Learning 🙂

The Data Monk

Day 3 – Basic Queries to get you started

I hope you have already read the introduction part where we discussed about the execution flow of all the clauses in SQL.

Today we will discuss around 15 questions which shall give you a good launching pad to directly jump on the complicated problems. You can, but you should not skip this article. Just go through it even if you are the king of SQL because “Why be a king, when you can be the God” (Comment if you know the song :P)

Your time is precious, so is mine. So, let’s start.

P.S. – Try to check the number of questions which you can answer by yourself(including the correct syntax). Check your score and it’s meaning at the end of the article

1. Select the maximum and second maximum salary from the employee table.

SELECT
  (SELECT MAX(Salary) FROM Emp) Max_Salary,
  (SELECT MAX(Salary) FROM Emp
  WHERE Salary NOT IN (SELECT MAX(Salary) FROM Emp)) as Second_max_salary

2. Get all employee detail from EmployeeDetail table whose “FirstName” not start with any single character between ‘a-p’

SELECT * 
FROM EmployeeDetail
WHERE FirstName like ‘[^a-p]%’

3. Write the query to get the branch and branch wise total(sum) admission fees, display it in ascending order according to admission fees.

Select Branch, 
SUM(Admission_Fee) as Total_admission_fees 
from Student
group by Branch 
order by SUM(Admission_fee) ASC

4. Get department name and average salary of all the employees (department wise)

Select Department, 
AVG(Salary) as Average_Salary 
From MuSigma
group by Department 
order by Department

5. Get department wise number of employees in Mu Sigma

Select Department,
COUNT(*)
FROM MuSigma
group by Department
order by Department

6. Will the code below work?

Select Name, Age as Actual_Age, Region
From Candidate
Where Actual_Age>35

The above query will not execute. Where clause should have actual field names

7. Find the 3rd Maximum salary in the employee table

Select distinct sal
from emp e1
where 3 = ((select count(distinct sal) from emp e2 where e1.sal <= e2.sal);

We need a table to understand Row_Number, Rank, etc.
The below table will be used in the next few queries

Emp. IDNameSalary
123Amit50000
453Sumit50000
232Rakshit30000
124Aman40000
543Rahul30000

8. What is ROW_NUMBER() function in SQL?
The ROW_NUMBER() ranking returns a unique and sequential number to
each row of the table without repeating or skipping any number. If there are two rows which are partitioned and holds the same value as Amit and Sumit, then the row number will be given randomly.

9. Write the syntax to create a new column using Row Number over the Salary column

SELECT *, ROW_NUMBER() OVER (Order By Salary) as Row_Num
FROM Employee

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300002
124Aman400003
123Amit500004
453Sumit500005

10. What is PARTITION BY clause?
PARTITION BY clause is used to create a partition of ranking in a table. If you partition by Salary in the above table, then it will provide a ranking based on each unique salary. Example below:-

SELECT *, ROW_NUMBER() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300002
124Aman400001
123Amit500001
453Sumit500002

11. What is a RANK() function? How is it different from ROW_NUMBER()?
– RANK() function gives ranking to a row based on the value on which you want to base your ranking. If there are equal values, then the rank will be repeated and the row following the repeated values will skip as many ranks as there are repeated values row. Confused?? Try out the example below:-

SELECT *, RANK() OVER (ORDER BY Salary) as Row_Num
FROM Employee

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300001
124Aman400003
123Amit500004
453Sumit500004

12. How to use PARTITION BY clause in RANK() function?
– There is no point using PARTITION BY clause with RANK() function. The ranking result will have no meaning, as the rank will be done according to Salary values per each partition, and the data will be partitioned according to the Salary values. And due to the fact that each partition will have rows with the same Salary values, the rows with the same Salary values in the same partition will be ranked with the value equal to 1. 

SELECT *, RANK() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num
From Employee
Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300001
124Aman400001
123Amit500001
453Sumit500001

13. What is Dense Ranking?
DENSE_RANK() is similar to the RANK() function but it does not skip any rank, so if there are two equal values then both will be termed as 1, the third value will be termed as 3 and not 2.

SELECT *, 
DENSE_RANK() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num
FROM Employee

Output

Emp. IDNameSalaryRow_Num
232Rakshit300001
543Rahul300001
124Aman400003
123Amit500004
453Sumit500004
432Nihar600006

14. What is NTILE() function?
NTILE() is similar to percentile NTILE(3) will divide the data in 3 parts. 

SELECT *, NTILE() OVER (ORDER BY Salary) as Ntile
FROM Employee

The number of rows should be 6/3 = 2, therefore we need to divide the 2 rows for each percentile

Emp. IDNameSalaryNtile
232Rakshit300001
543Rahul300001
124Aman400002
123Amit500002
453Sumit500003
432Nihar600003

15. Can we use HAVING without GROUP BY?
Have you heard it lately that Having can be used only with group by?
It’s actually true, but can we write a query using Having without group by?

Yes we can, but it won’t get you anything fruitful 😛

Select * 
From Employee
Having 1=1

Having is used to filter groups .
Where clause is used to filter rows.

12+ correct answers -> Good enough to move to the next article
8-12 correct answers -> Go through sqlzoo website to chisel your skills
4-8 correct answers -> Go through either tutorialspoint or w3school before proceeding
0-4 -> Bhai kar kya rhe ho jindgi ke saath ? 100-200 leke koi nya course join kr lo 😛

We will discuss some more questions in the next article.

Keep Learning 🙂

The Data Monk

Day 2 – Introduction to SQL

Structured Query Language is the base of your Data Science career. You will always be surrounded with SQL queries and it’s very important to understand the basics of the language.

We will not waste time on defining terminologies associated with the language. If you want to start from the very basic then do go through www.tutorialspoint.com.

We will start with the most asked question in a Data Science interview.

“Okay Aman, tell me the sequence of SQL command execution”

And Aman starts remembering what he has by-hearted last night 😛

Let’s make this easy to understand.

Where does your data comes from ?
Table, right?
How many tables? Any number, depending on the number of joins.

So, first of all the query will execute the “FROM” and all the “JOIN” tables, because you need to extract some data to process it, right?

So, the first command to run is FROM and JOIN.

Now, i have the mammoth looking data dump(10,000 rows), what next?
I will apply a WHERE condition, remember, Where condition cuts down the number of “rows” from the table.

The second command to get executed is “WHERE”.

Now i have all the rows which i need, Group by simply groups the data on one or more field. Cool?

I have 50,000 rows of 50 students. I apply a group by and the number of rows is reduced to 1000.

GROUP BY is the next command to be executed after WHERE clause.

FROM,JOIN -> WHERE -> GROUP BY

What else do we have?
Ummm, We have HAVING, LIMIT,SELECT,ORDER BY

We all know that HAVING is always used along with Group by to apply condition on aggregated value. If you are unsure about the difference between WHERE and HAVING command then, God bless you 😛

Well WHERE is applied on rows and HAVING is applied on aggregate value like – HAVING sum(Marks) > 1000

So, HAVING further applies condition on your group by. Thus HAVING follows Group by

FROM,JOIN -> WHERE -> GROUP BY -> HAVING

Now we have SELECT,LIMIT and ORDER BY

We have filtered out everything, now is the time to SELECT the columns which you need.

You have the selected columns, which you will order and then apply Limit to it.

There is no point ordering 10 columns and then selecting 4 ordered columns, right?

You first limit the columns, then apply the order and then if there is a need to apply a limit to the fetched record, apply a LIMIT

Thus, the final order is

FROM,JOIN
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
LIMIT


Read it twice if you are unable to get the concept. You should be able to recite the whole story even at 3 in the morning 😛

2. What is fetch and offset?
This might look fancy but the dictionary meaning will do the trick. Offset is like, “how many rows you want to chuck” and fetch is like “how many rows do you want to include”

So If my table has data of top 100 students.
OFFSET 10 FETCH 5 will get you the 11th to 15th rank student.

Where will these two clause fall in the execution ?
FETCH and OFFSET are parallel to the LIMIT command

3. What is the difference between ‘HAVING’ CLAUSE and a ‘WHERE’ CLAUSE?
HAVING clause can be used only with SELECT statement. 
It is usually used in a GROUP BY clause and whenever GROUP BY is not used, HAVING behaves like a WHERE clause. 
Having Clause is only used with the GROUP BY function in a query 
whereas WHERE Clause is applied to each row before they are a part of the GROUP BY function in a query.

4. What is self join?
 The SQL SELF JOIN is used to join a table to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement.

SELECT a.column_name, b.column_name…
FROM table1 a, table1 b
WHERE a.common_field = b.common_field; 

5.
SELECT – select records from a table
INSERT – insert new records
UPDATE – update/Modify existing records
DELETE – delete existing records 
CREATE – create a new Table, database, schema
ALTER – alter existing table, column description
DROP – delete existing objects from database
GRANT – allows users to read/write on certain database objects
REVOKE – keeps users from read/write permission on database objects

6. A field with a NULL value is a field with no value. It is very important to understand that a NULL value is different than a zero value or a field that contains spaces.

Yahan light nai le lena interview main

7. Like clause is an important concept in SQL. Whenever you want to apply a like clause in a dataset with varchar datatype then do cast everything in either lower or upper case.

lower(name) like ‘nitin kamal’

is always preferred over

name like ‘Nitin Kamal’

Your code should be able to minimize margin of errors.

You should definitely try SQLZOO to practice SQL queries. The link is given below:-

What ?? Go and Google it daa 😛

Keep Learning 🙂

The Data Monk





30 Days of Data Science

Data Science can be termed as a branch of Computer Science which revolves exclusively around Statistics, Data, and Algorithms. There are a hell lot of things which you can learn to chisel your skills.

In this 30 Days of Data Science, we aim to increase your craving towards the domain and to help you draw a path to make a career in this field.

We can safely assume that SQL, Statistics, and Machine Learning is the bread and butter of this domain. So, in the coming 30 articles, we will get a gist of what Data Science is and how to get started.

Following is the first draft of the plan, if you want to learn something more, then do comment the topic.

Day 1 – Introduction to Data Science(Pillars of the domain)
Day 2 – Introduction to SQL
Day 3 – Basic SQL Queries to get you started
Day 4 – Intermediate level SQL queries
Day 5 – SQL advance concepts
Day 6 – SQL tricky questions(read interview questions)
Day 7 – Python installations and basics to get you started
Day 8 – Python tricky questions
Day 9 – Exploratory Data Analysis in Python
Day 10 – Practice Numpy
Day 11 – Practice Pandas
Day 12 – Understand Linear Regression
Day 13 – Implement Linear Regression in Python
Day 14 – Understand Logistic Regression
Day 15 – Implement Logistic Regression
Day 16 – Python Interview questions
Day 17 – Understand NLP in layman terms
Day 18 – NLP basic algorithms in Python
Day 19 – Understand Time Series forecasting
Day 20 – Implement Time series algorithms
Day 21 – Understand Web Scrapping
Day 22 – Web Scrapping mini project
Day 23 – Basics of Statistics
Day 24 – Intermediate Statistics
Day 25 – Statistics interview Questions
Day 26 – Solving Case Studies
Day 27 – Solving Guesstimates
Day 28 – Solving Brain teasers
Day 29 – Solving Hackathons – Missing Value Treatment and EDA
Day 30 – Implementing algorithms and checking the performance of the models

Each article will be a 30 minutes read. By the end of the time frame you will be able to implement Linear,Logistic,NLP,Time Series algorithms, Web Scrapping along with an intermediate level knowledge of SQL,Python, and Statistics.

You will be able to solve a hackathon end-to-end.

All the best 🙂

The Data Monk