Data Science Interview Questions Archives

Affine Analytics Interview Questions | Day 17

Company – Affine Analytics
Location – Bangalore
Position – Senior Business Analyst
Experience – 3+ years

Compensation – Best in the industry

Affine Analytics Interview Questions

Number of Rounds – 4

I received a call from the Technical HR who scheduled the telephonic round for the next day

Round 1 – Telephonic Round (Mostly SQL and Project)
I was asked to introduce myself and then the discussion went towards my recent project at Mu Sigma. We had a good discussion on Regression Techniques, a bit on statistics.

The project description was followed by few questions on SQL (the answers to these questions are present in various articles on the website, links are at the end of the interview)

1. What is the order of SQL query execution?
2. You have two tables with one column each. The table A has 5 values and all the values are 1 i.e. 1,1,1,1,1 and Table B has 3 values and all the values are 1 i.e. 1,1,1.

How many rows will be there if we do the following
1. Left Join
2. Right Join
3. Inner Join
Answer:
https://thedatamonk.com/day-4-sql-intermediate-questions/

3. A quick guesstimate on number of Iphones sold in India per year

Hint in the below link – https://thedatamonk.com/guesstimate-3-what-are-the-number-of-smartphones-sold-in-india-per-year/

4. What is a RANK() function? How is it different from ROW_NUMBER()?
https://thedatamonk.com/question/affine-analytics-interview-questions-what-is-a-rank-function-how-is-it-different-from-row_number/

5. How to fetch only even rows from a table?

Link to Question 4 and 5 – https://thedatamonk.com/day-11-sql-tricky-questions/
https://thedatamonk.com/question/affine-analytics-interview-questions-how-to-fetch-only-even-rows-from-a-table/

6. What are the measures of Central Tendency
https://thedatamonk.com/day-14-basic-statistics/

The telephonic round went for around 1 hour:-
Introduction – 10 minutes
Project – 30 minutes
Questions – 20 minutes

I was shortlisted for the further rounds.
All together the face-to-face interviews were divided into 3 rounds
Round 1 – SQL and R/Python
Round 2 – Statistics
Round 3 – Case Study and HR questions

Round 1
There were ~20 questions on SQL and some questions on Language.
Below are the questions which I remember:-
1. Optimising a SQL code
2. Doing a sum on a column with Null values
Hint – Use Coalesce
3. How to find the count of duplicate rows?
https://thedatamonk.com/question/affine-analytics-interview-questions-how-to-get-3-min-salaries/
4. Use of Lag function
Link – https://thedatamonk.com/day-5-sql-advance-concepts/
5. Life cycle of a project
6. How to find the second minimum salary?
7. How to get 3 Min salaries?
8. DDL, DML, and DCL commands
https://thedatamonk.com/day-13-sql-theoretical-questions/

There were few more questions on Joins and optimising inner query codes.
Overall difficulty level- 8/10

There were 5 questions on Python/R –

Loading a csv/text file
Writing code of Linear Regression (As it was mentioned on my resume)
Doing a right join in either of the language
Removing null value from a column
https://thedatamonk.com/question/affine-analytics-interview-questions-removing-null-value-from-a-column/

Round 3 – Statistics

How to calculate IQR?
What is positive skewness and negative skewness?
https://thedatamonk.com/question/affine-analytics-interview-questions-what-is-positive-skewness-and-negative-skewness/
What are the two types of regression?
What is multiple linear regression?
What is Logistic Regression?
What is p-value and give an example?

These questions were discussed in detail and I power the explanation with real life examples.

https://thedatamonk.com/day-18-statistics-interview-questions/

Bonus tips – Do look for good examples

Round 4 – Case Study and HR Questions

How many laptops are sold in Bangalore in a Day ?
https://thedatamonk.com/guesstimate-2-how-many-laptops-are-sold-in-bangalore-in-a-day/

Business Case Study – There is a mobile company which is very popular in Other Asian countries. The company is planning to open it’s branch in the most popular mall of Bangalore.
What should be the strategy of the company?
How can you use freely available data to plan the marketing of the campaigns?
How can you use Digital marketing to create campaigns for the company?

https://thedatamonk.com/question/affine-analytics-interview-questions-business-case-study/

These questions were followed by:-
Why do you want to change the company?
How is the work in your current organisation?

I got the confirmation in 2 working days.

This was it

Amazon Interview Questions
Sapient Interview Questions

Full interview question of these round is present in our book What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow

You can get your hand on our ebooks, we also have a 10 e-book bundle offer at Rs.549 where you get a total of 1400 questions.
Comment below or mail at contact@thedatamonk.com for more information

1. The Monk who knew Linear Regression (Python): Understand, Learn and Crack Data Science Interview
2. 100 Python Questions to crack Data Science/Analyst Interview
3. Complete Linear Regression and ARIMA Forecasting project using R
4. 100 Hadoop Questions to crack data science interview: Hadoop Cheat Sheet
5. 100 Questions to Crack Data Science Interview
6. 100 Puzzles and Case Studies To Crack Data Science Interview
7. 100 Questions To Crack Big Data Interview
8. 100 Questions to Learn R in 6 Hours
9. Complete Analytical Project before Data Science interview
10. 112 Questions To Crack Business Analyst Interview Using SQL
11. 100 Questions To Crack Business Analyst Interview
12. A to Z of Machine Learning in 6 hours
13. In 2 Hours Create your first Azure ML in 23 Steps
14. How to Start A Career in Business Analysis
15. Web Analytics – The Way we do it
16. Write better SQL queries + SQL Interview Questions
17. How To Start a Career in Data Science
18. Top Interview Questions And All About Adobe Analytics
19. Business Analyst and MBA Aspirant’s Complete Guide to Case Study – Case Study Cheatsheet
20. 125 Must have Python questions before Data Science interview
21. 100 Questions To Understand Natural Language Processing in Python
22. 100 Questions to master forecasting in R: Learn Linear Regression, ARIMA, and ARIMAX
23. What do they ask in Top Data Science Interviews
24. What do they ask in Top Data Science Interviews: Part 1

Keep Learning

10 Questions, 10 Minutes – 2/100

The articles/questions will revolve around SQL, Statistics, Python/R, MS Excel, Statistical Modelling, and case studies.

The questions will be a mix of these topics to help you prepare for interviews

You can also contribute by framing 10 questions and sending it to contact@thedatamonk.com or messaging me on Linkedin.

The questions will be updated late in the night ~1-2 a.m. and will be posted on Linkedin as well.

Let’s see how many can we solve in the next 100 posts

1/100 – SQL Questions

1. How to find the minimum salary using subquery?
-SELECT *
FROM employee
WHERE salary = (select MIN(salary) from employee);

2. How to find the second minimum salary?
– SELECT *
FROM employee
WHERE salary = (SELECT MIN(salary) FROM employee > SELECT MIN(salary) FROM employee)

Similarly, find the third minimum salary

– SELECT *
FROM employee
WHERE salary = (SELECT MIN(salary) FROM employee > SELECT MIN(salary) FROM employee > SELECT MIN(salary) FROM employee)

3. The above query is too lengthy, write a query to get the third minimum salary with some other method.

– SELECT DISTINCT (salary)
FROM emp e1 where 3 = (SELECT COUNT(DISTINCT salary) FROM emp e2 WHERE e1.sal >= e2.sal);

4. How to get 3 Min salaries?
-SELECT DISTINCT salary FROM emp a WHERE 3 >= (SELECT COUNT(DISTINCT salary) FROM emp b WHERE a.salary >= b.salary);

5. Some basic SQL Select questions
– SELECT 125
125
-SELECT ‘Ankit’+’1’
Ankit1
-SELECT ‘Ankit’+1
Error
– SELECT ‘2’+2
4
-SELECT SUM(‘1’)
1

6. Write a generic method to fetch the nth highest salary without TOP or Limit
–
SELECT Salary
FROM Worker W1
WHERE n-1 = (
SELECT COUNT( DISTINCT ( W2.Salary ) )
FROM Worker W2
WHERE W2.Salary >= W1.Salary
);

7. LAG(): Provides access to a row at a given physical offset that comes before the current row. Use this function in a SELECT statement to compare values in the current row with values in a previous row as
specified by offset. Default offset is 1 if not specified. If Partition By clause is specified then it returns the offset Value in each partition after ordering the partition by Order By Clause.

Basically, lag() is used to create one more column in the table where you can get the previous value of the specified column

Col1	Col2	Lag_Col
a	10	Null
b	20	10
c	30	20
d	40	30

8.One more example

employee_number	last_name	first_name	salary	dept_id
12009	Sutherland	Barbara	54000	45
34974	Yates	Fred	80000	45
34987	Erickson	Neil	42000	45
45001	Parker	Sally	57500	30
75623	Gates	Steve	65000	30

SELECT dept_id, last_name, salary,
LAG (salary,1) OVER (ORDER BY salary) AS lower_salary
FROM employees;

dept_id	last_name	salary	lower_salary
45	Erickson	42000	NULL
45	Sutherland	54000	42000
30	Parker	57500	54000
30	Gates	65000	57500
45	Yates	80000	65000

9. LEAD() – Provides access to a row at a given physical offset that comes after the current row. Use this function in a SELECT statement to compare values in the current row with values in a subsequent row
as specified by offset. Default offset is 1 if not specified. If Partition By clause is specified then it returns the offset Value in each partition after ordering the partition by Order By Clause

10. Which operator is used for Pattern Matching?

LIKE operator is used for pattern matching. It supports below wildcards.
 % : Matches any string of zero or more characters.
 _ : Matches any single character.
 [] : Matches any single character within the specified range ([a-f]) or set ([abcdef]).
 [^] : Matches any single character not within the specified range ([^a-f]) or set ([^abcdef])

This was the second set of 10 questions, if you want to learn more about the type of questions asked in different Data Science interviews then do try the below book:-

What do they ask in top Data Science Interviews: 5 Complete Data Science Real Interviews Q and A

What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow

Keep Learning 🙂

The Data Monk

10 Questions, 10 Minutes – 1/100

This is something which has been on my mind since a long time. We will be picking 10 questions per day and would like to simplify it.
We will make sure that the complete article is covered in 10 minutes by the reader. There will be 100 posts in the coming 3 months.

The articles/questions will revolve around SQL, Statistics, Python/R, MS Excel, Statistical Modelling, and case studies.

The questions will be a mix of these topics to help you prepare for interviews

You can also contribute by framing 10 questions and sending it to contact@thedatamonk.com or messaging me on Linkedin.

The questions will be updated late in the night ~1-2 a.m. and will be posted on Linkedin as well.

Let’s see how many can we solve in the next 100 posts

1. Write the syntax to create a new column using Row Number over the Salary column

SELECT *, ROW_NUMBER() OVER (Order By Salary) as Row_Num
FROM Employee

Output

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	2
124	Aman	40000	3
123	Amit	50000	4
453	Sumit	50000	5

2. What is PARTITION BY clause?
PARTITION BY clause is used to create a partition of ranking in a table. If you partition by Salary in the above table, then it will provide a ranking based on each unique salary. Example below:-

SELECT *, ROW_NUMBER() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	2
124	Aman	40000	1
123	Amit	50000	1
453	Sumit	50000	2

3. What is a RANK() function? How is it different from ROW_NUMBER()?
– RANK() function gives ranking to a row based on the value on which you want to base your ranking. If there are equal values, then the rank will be repeated and the row following the repeated values will skip as many ranks as there are repeated values row. Confused?? Try out the example below:-

SELECT *, RANK() OVER (ORDER BY Salary) as Row_Num
FROM Employee

Output

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	1
124	Aman	40000	3
123	Amit	50000	4
453	Sumit	50000	4

As you can see, the rank 2 has been skipped because there were two employees with the same Salary and the result is ordered in ascending order by default.

4. What is Dense Ranking?
– DENSE_RANK() is similar to the RANK() function but it does not skip any rank, so if there are two equal values then both will be termed as 1, the third value will be termed as 3 and not 2.

Syntax:-
SELECT *, DENSE_RANK() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num
FROM Employee

Output:-

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	1
124	Aman	40000	3
123	Amit	50000	4
453	Sumit	50000	4
432	Nihar	60000	6

5. What is NTILE() function?
-NTILE() is similar to percentile NTILE(3) will divide the data in 3 parts.

SELECT *, NTILE() OVER (ORDER BY Salary) as Ntile
FROM Employee

The number of rows should be 6/3 = 2, therefore we need to divide the 2 rows for each percentile

Emp. ID	Name	Salary	Ntile
232	Rakshit	30000	1
543	Rahul	30000	1
124	Aman	40000	2
123	Amit	50000	2
453	Sumit	50000	3
432	Nihar	60000	3

6. How to get the second highest salary from a table?
Select MAX(Salary)
from Employee
Where Salary NOT IN (SELECT MAX(Salary) from Employee)

7. Find the 3rd Maximum salary in the employee table
-Select distinct sal
from emp e1
where 3 = ((select count(distinct sal) from emp e2 where e1.sal <= e2.sal);

8. Get all employee detail from EmployeeDetail table whose “FirstName” not start with any single character between ‘a-p’
– SELECT *
FROM EmployeeDetail
WHERE FirstName like ‘[^a-p]%’

9. How to fetch only even rows from a table?
-The best way to do it is by adding a row number using ROW_NUMBER() and then pulling the alternate row number using row_num%2 = 0

Suppose, there are 3 columns in a table i.e. student_ID, student_Name, student_Grade. Pull the even rows

SELECT *
FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY student_ID) as row_num FROM student) x
WHERE x.row_num%2=0

10. How to fetch only odd rows from the same table?
-Simply apply the x.row_num%2 <> 0 to get the odd rows

SELECT *
FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY student_ID) as row_num FROM student) x
WHERE x.row_num%2 <> 0

Let us know if you think I need to change any answer here.

Keep Learning 🙂

The Data Monk

Top 100 Power BI Interview Questions – Part 1/2

Q1. What are the parts of Microsoft self-service business intelligence solution?

Microsoft has two parts for Self-Service BI

Excel BI Toolkit	It Allows users to create an interactive report by importing data from different sources and model data according to report requirement.
Power BI	It is The online solution that enables you to share the interactive reports and queries that you have created using the Excel BI Toolkit.

Q2. What is self-service business intelligence?

Self-Service Business Intelligence (SSBI)

SSBI is an approach to data analytics that enables business users to filter, segment, and, analyze their data, without the in-depth technical knowledge in statistical analysis, business intelligence (BI).
SSBI has made it easier for end users to access their data and create various visuals to get better business insights.
Anybody who has a basic understanding of the data can create reports to build intuitive and shareable dashboards.

Q3. What is Power BI?

Power BI is a cloud-based data sharing environment. Once you have developed reports using Power Query, Power Pivot and Power View, you can share your insights with your colleagues. This is where Power BI enters the equation. Power BI, which technically is an aspect of SharePoint online, lets you load Excel workbooks into the cloud and share them with a chosen group of co-workers. Not only that, but your colleagues can interact with your reports to apply filters and slicers to highlight data. They are completed by Power BI, a simple way of sharing your analysis and insights from the Microsoft cloud.

Power BI features allow you to:

Share presentations and queries with your colleagues.
Update your Excel file from data sources that can be on-site or in the cloud.
Display the output on multiple devices. This includes PCs, tablets, and HTML 5-enabled mobile devices that use the Power BI app.
Query your data using natural language processing (or Q&A, as it is known).

Q4. What is Power BI Desktop?

Power BI Desktop is a free desktop application that can be installed right on your own computer. Power BI Desktop works cohesively with the Power BI service by providing advanced data exploration, shaping, modeling, and creating report with highly interactive visualizations. You can save your work to a file or publish your data and reports right to your Power BI site to share with others.

Q5. What data sources can Power BI connect to?

The list of data sources for Power BI is extensive, but it can be grouped into the following:

Files: Data can be imported from Excel (.xlsx, xlxm), Power BI Desktop files (.pbix) and Comma Separated Value (.csv).
Content Packs: It is a collection of related documents or files that are stored as a group. In Power BI, there are two types of content packs, firstly those from services providers like Google Analytics, Marketo or Salesforce and secondly those created and shared by other users in your organization.
Connectors to databases and other datasets such as Azure SQL, Database and SQL, Server Analysis Services tabular data, etc.

Q6. What are Building Blocks in Power BI?

The following are the Building Blocks (or) key components of Power BI:

Visualizations: Visualization is a visual representation of data.
Example: Pie Chart, Line Graph, Side by Side Bar Charts, Graphical Presentation of the source data on top of Geographical Map, Tree Map, etc.
Datasets: Dataset is a collection of data that Power BI uses to create its visualizations.
Example: Excel sheets, Oracle or SQL server tables.
Reports: Report is a collection of visualizations that appear together on one or more pages.
Example: Sales by Country, State, City Report, Logistic Performance report, Profit by Products report etc.
Dashboards: Dashboard is single layer presentation of multiple visualizations, i.e we can integrate one or more visualizations into one page layer.
Example: Sales dashboard can have pie charts, geographical maps and bar charts.
Tiles: Tile is a single visualization in a report or on a dashboard.
Example: Pie Chart in Dashboard or Report.

Q7. What are the different types of filters in Power BI Reports?

Power BI provides variety of option to filter report, data and visualization. The following are the list of Filter types.

Visual-level Filters: These filters work on only an individual visualization, reducing the amount of data that the visualization can see. Moreover, visual-level filters can filter both data and calculations.
Page-level Filters: These filters work at the report-page level. Different pages in the same report can have different page-level filters.
Report-level Filters: There filters work on the entire report, filtering all pages and visualizations included in the report.

We know that Power BI visual have interactions feature, which makes filtering a report a breeze. Visual interactions are useful, but they come with some limitations:

The filter is not saved as part of the report. Whenever you open a report, you can begin to play with visual filters but there is no way to store the filter in the saved report.
The filter is always visible. Sometimes you want a filter for the entire report, but you do not want any visual indication of the filter being applied.

Q8. What are content packs in Power BI?

Content packs for services are pre-built solutions for popular services as part of the Power BI experience. A subscriber to a supported service, can quickly connect to their account from Power BI to see their data through live dashboards and interactive reports that have been pre-built for them. Microsoft has released content packs for popular services such as Salesforce.com, Marketo, Adobe Analytics, Azure Mobile Engagement, CircuitID, comScore Digital Analytix, Quickbooks Online, SQL Sentry and tyGraph. Organizational content packs provide users, BI professionals, and system integrator the tools to build their own content packs to share purpose-built dashboards, reports, and datasets within their organization.

Q9. What is DAX?

To do basic calculation and data analysis on data in power pivot, we use Data Analysis Expression (DAX). It is formula language used to compute calculated column and calculated field.

DAX works on column values.
DAX can not modify or insert data.
We can create calculated column and measures with DAX but we can not calculate rows using DAX.

Sample DAX formula syntax:

For the measure named Total Sales, calculate (=) the SUM of values in the [SalesAmount] column in the Sales table.

A- Measure Name

B- = – indicate beginning of formula

C- DAX Function

D- Parenthesis for Sum Function

E- Referenced Table

F- Referenced column name

Q9. What are some of the DAX functions?

Below are some of the most commonly used DAX function:

SUM, MIN, MAX, AVG, COUNTROWS, DISTINCTCOUNT
IF, AND, OR, SWITCH
ISBLANK, ISFILTERED, ISCROSSFILTERED
VALUES, ALL, FILTER, CALCULATE,
UNION, INTERSECT, EXCEPT, NATURALINNERJOIN, NATURALLEFTEROUTERJOIN,
SUMMARIZECOLUMNS, ISEMPTY,
VAR (Variables)
GEOMEAN, MEDIAN, DATEDIFF

Q11. How is the FILTER function used?

The FILTER function returns a table with a filter condition applied for each of its source table rows. The FILTER function is rarely used in isolation, it’s generally used as a parameter to other functions such as CALCULATE.

FILTER is an iterator and thus can negatively impact performance over large source tables.
Complex filtering logic can be applied such as referencing a measure in a filter expression.
- FILTER(MyTable,[SalesMetric] > 500)

Q12. What are the functions and limitations of DAX?

These are the only functions that allow you modify filter context of measures or tables.

Add to existing filter context of queries.
Override filter context from queries.
Remove existing filter context from queries.

Limitations:

Filter parameters can only operate on a single column at a time.
Filter parameters cannot reference a metric.

Q9. What is SUMMARIZE() and SUMMARIZECOLUMNS() DAX?

SUMMARIZE()

Main group by function in SSAS.
Recommended practice is to specify table and group by columns but not metrics.You can use ADDCOLUMNS function.

SUMMARIZECOLUMNS

New group by function for SSAS and Power BI Desktop; more efficient.
Specify group by columns, table, and expressions.

Q14. What are some benefits of using Variables in DAX ?

Below are some of the benefits:

By declaring and evaluating a variable, the variable can be reused multiple times in a DAX expression, thus avoiding additional queries of the source database.
Variables can make DAX expressions more intuitive/logical to interpret.
Variables are only scoped to their measure or query, they cannot be shared among measures, queries or be defined at the model level.

Q15. How would you create trailing X month metrics via DAX against a non-standard calendar?

The solution will involve:

CALCULATE function to control (take over) filter context of measures.
ALL to remove existing filters on the date dimension.
FILTER to identify which rows of the date dimension to use.

Alternatively, CONTAINS may be used:

CALCULATE(FILTER(ALL(‘DATE’),…….))

Q16. What are the different BI add-in to Excel ?

Below are the most important BI add-in to Excel:

Power Query: It helps in finding, editing and loading external data.
Power Pivot: Its mainly used for data modeling and analysis.
Power View: It is used to design visual and interactively reports.
Power Map: It helps to display insights on 3D Map.

Q17. What is Power Pivot?

Power Pivot is an add-in for Microsoft Excel 2010 that enables you to import millions of rows of data from multiple data sources into a single Excel workbook. It lets you create relationships between heterogeneous data, create calculated columns and measures using formulas, build PivotTables and PivotCharts. You can then further analyze the data so that you can make timely business decisions without requiring IT assistance.

Q18. What is Power Pivot Data Model?

It is a model that is made up of data types, tables, columns, and table relations. These data tables are typically constructed for holding data for a business entity.

Q19. What is xVelocity in-memory analytics engine used in Power Pivot?

The main engine behind power pivot is the xVelocity in-memory analytics engine. It can handle large amount of data because it stores data in columnar databases, and in memory analytics which results in faster processing of data as it loads all data to RAM memory.

Q20. What are some of differences in data modeling between Power BI Desktop and Power Pivot for Excel?

Here are some of the differences:

Power BI Desktop supports bi-directional cross filtering relationships, security, calculated tables, and Direct Query options.
Power Pivot for Excel has single direction (one to many) relationships, calculated columns only, and supports import mode only. Security roles cannot be defined in Power Pivot for Excel.

Q21. Can we have more than one active relationship between two tables in data model of power pivot?

No, we cannot have more than one active relationship between two tables. However, can have more than one relationship between two tables but there will be only one active relationship and many inactive relationship. The dotted lines are inactive and continuous line are active.

Q22. What is Power Query?

Power query is a ETL Tool used to shape, clean and transform data using intuitive interfaces without having to use coding. It helps the user to:

Import Data from wide range of sources from files, databases, big data, social media data, etc.
Join and append data from multiple data sources.
- Shape data as per requirement by removing and adding data.

Q23. What are the data destinations for Power Queries?

There are two destinations for output we get from power query:

Load to a table in a worksheet.
Load to the Excel Data Model.

Q24. What is query folding in Power Query?

Query folding is when steps defined in Power Query/Query Editor are translated into SQL and executed by the source database rather than the client machine. It’s important for processing performance and scalability, given limited resources on the client machine.

Q25. What are some common Power Query/Editor Transforms?

Changing Data Types, Filtering Rows, Choosing/Removing Columns, Grouping, Splitting a column into multiple columns, Adding new Columns ,etc.

Q26. Can SQL and Power Query/Query Editor be used together?

Yes, a SQL statement can be defined as the source of a Power Query/M function for additional processing/logic. This would be a good practice to ensure that an efficient database query is passed to the source and avoid unnecessary processing and complexity by the client machine and M function.

Q28. What are query parameters and Power BI templates?

Query parameters can be used to provide users of a local Power BI Desktop report with a prompt, to specify the values they’re interested in.

The parameter selection can then be used by the query and calculations.
PBIX files can be exported as Templates (PBIT files).
Templates contain everything in the PBIX except the data itself.

Parameters and templates can make it possible to share/email smaller template files and limit the amount of data loaded into the local PBIX files, improving processing time and experience.

Q29. Which language is used in Power Query?

A new programming language is used in power query called M-Code. It is easy to use and similar to other languages. M-code is case sensitive language.

Q30. Why do we need Power Query when Power Pivot can import data from mostly used sources?

Power Query is a self-service ETL (Extract, Transform, Load) tool which runs as an Excel add-in. It allows users to pull data from various sources, manipulate said data into a form that suits their needs and load it into Excel. It is most optimum to use Power Query over Power Pivot as it lets you not only load the data but also manipulate it as per the users needs while loading.

Q31. What is Power Map?

Power Map is an Excel add-in that provides you with a powerful set of tools to help you visualize and gain insight into large sets of data that have a geo-coded component. It can help you produce 3D visualizations by plotting upto a million data points in the form of column, heat, and bubble maps on top of a Bing map. If the data is time stamped, it can also produce interactive views that display, how the data changes over space and time.

Q32. What are the primary requirement for a table to be used in Power Map?

For a data to be consumed in power map there should be location data like:

Latitude/Longitude pair
Street, City, Country/Region, Zip Code/Postal Code, and State/Province, which can be geolocated by Bing

The primary requirement for the table is that it contains unique rows. It must also contain location data, which can be in the form of a Latitude/Longitude pair, although this is not a requirement. You can use address fields instead, such as Street, City, Country/Region, Zip Code/Postal Code, and State/Province, which can be geolocated by Bing.

Q33. What are the data sources for Power Map?

The data can either be present in Excel or could be present externally. To prepare your data, make sure all of the data is in Excel table format, where each row represents a unique record. Your column headings or row headings should contain text instead of actual data, so that Power Map will interpret it correctly when it plots the geographic coordinates. Using meaningful labels also makes value and category fields available to you when you design your tour in the Power Map Tour Editor pane.

To use a table structure which more accurately represents time and geography inside Power Map, include all of the data in the table rows and use descriptive text labels in the column headings, like this:

Example of correct table format - Power BI Interview Questions -Edureka

In case you wish to load your data from an external source:

In Excel, click Data > the connection you want in the Get External Data group.
Follow the steps in the wizard that starts.
On the last step of the wizard, make sure Add this data to the Data Model is checked.

Q34. What is Power View?

Ans: Power View is a data visualization technology that lets you create interactive charts, graphs, maps, and other visuals which bring your data to life. Power View is available in Excel, SharePoint, SQL Server, and Power BI.

The following pages provide details about different visualizations available in Power View:

Charts
Line charts
Pie charts
Maps
Tiles
Cards
Images
Tables
Power View
Multiples Visualizations
Bubble and scatter charts
Key performance indicators (KPIs)

Q35. What is Power BI Designer?

Ans: It is a stand alone application where we can make Power BI reports and then upload it to Powerbi.com, it does not require Excel. Actually, it is a combination of Power Query, Power Pivot, and Power View.

Q36. Can we refresh our Power BI reports once uploaded to cloud (Share point or Powebi.com)?

Ans: Yes we can refresh our reports through Data Management gateway(for sharepoint), and Power BI Personal gateway(for Powerbi.com)

Q37. What are the different types of refreshing data for our published reports?

Ans: There are four main types of refresh in Power BI. Package refresh, model or data refresh, tile refresh and visual container refresh.

Package refresh

This synchronizes your Power BI Desktop, or Excel, file between the Power BI service and OneDrive, or SharePoint Online. However, this does not pull data from the original data source. The dataset in Power BI will only be updated with what is in the file within OneDrive, or SharePoint Online.

Model/data refresh

It referrs to refreshing the dataset, within the Power BI service, with data from the original data source. This is done by either using scheduled refresh, or refresh now. This requires a gateway for on-premises data sources.

Tile refresh

Tile refresh updates the cache for tile visuals, on the dashboard, once data changes. This happens about every fifteen minutes. You can also force a tile refresh by selecting the ellipsis (…) in the upper right of a dashboard and selecting Refresh dashboard tiles.

Visual container refresh

Refreshing the visual container updates the cached report visuals, within a report, once the data changes.

Q38. Is Power BI available on-premises?

No, Power BI is not available as a private, internal cloud service. However, with Power BI and Power BI Desktop, you can securely connect to your own on-premises data sources. With the On-premises Data Gateway, you can connect live to your on-premises SQL Server Analysis Services, and other data sources. You can also scheduled refresh with a centralized gateway. If a gateway is not available, you can refresh data from on-premises data sources using the Power BI Gateway – Personal.

Q39. What is data management gateway and Power BI personal gateway?

Gateway acts a bridge between on-premises data sources and Azure cloud services.

Personal Gateway:

Import Only, Power BI Service Only, No central monitoring/managing.
Can only be used by one person (personal); can’t allow others to use this gateway.

On-Premises Gateway:

Import and Direct Query supported.
Multiple users of the gateway for developing content.
Central monitoring and control.

Q40. What is Power BI Q&A?

Power BI Q&A is a natural language tool which helps in querying your data and get the results you need from it. You do this by typing into a dialog box on your Dashboard, which the engine instantaneously generates an answer similar to Power View. Q&A interprets your questions and shows you a restated query of what it is looking from your data. Q&A was developed by Server and Tools, Microsoft Research and the Bing teams to give you a complete feeling of truly exploring your data.

41). What are some ways that Excel experience can be leveraged with Power BI?

Below are some of the ways through which we can leverage Power BI:

The Power BI Publisher for Excel:
- Can be used to pin Excel items (charts, ranges, pivot tables) to Power BI Service.
- Can be used to connect to datasets and reports stored in Power BI Service.
Excel workbooks can be uploaded to Power BI and viewed in the browser like Excel Services.
Excel reports in the Power BI service can be shared via Content Packs like other reports.
Excel workbooks (model and tables) can be exported to service for PBI report creation.
Excel workbook Power Pivot models can be imported to Power BI Desktop models.

Q42. What is a calculated column in Power BI and why would you use them?

Calculated Columns are DAX expressions that are computed during the model’s processing/refresh process for each row of the given column and can be used like any other column in the model.

Calculated columns are not compressed and thus consume more memory and result in reduced query performance. They can also reduce processing/refresh performance if applied on large fact tables and can make a model more difficult to maintain/support given

that the calculated column is not present in the source system.

Q43. How is data security implemented in Power BI ?

Power BI can apply Row Level Security roles to models.

A DAX expression is applied on a table filtering its rows at query time.
Dynamic security involves the use of USERNAME functions in security role definitions.
Typically a table is created in the model that relates users to specific dimensions and a role.

Q44. What are many-to-many relationships and how can they be addressed in Power BI ?

Many to Many relationships involve a bridge or junction table reflecting the combinations of two dimensions (e.g. doctors and patients). Either all possible combinations or those combinations that have occurred.

Bi-Directional Crossfiltering relationships can be used in PBIX.
CROSSFILTER function can be used in Power Pivot for Excel.
DAX can be used per metric to check and optionally modify the filter context.

Q45. Why might you have a table in the model without any relationships to other tables?

There are mainly 2 reasons why we would have tables without relations in our model:

A disconnected table might be used to present the user with parameter values to be exposed and selected in slicers (e.g. growth assumption.)
- DAX metrics could retrieve this selection and use it with other calculations/metrics.
A disconnected table may also be used as a placeholder for metrics in the user interface.
- It may not contain any rows of data and its columns could be hidden but all metrics are visible.

46). What is the Power BI Publisher for Excel?

You can use Power BI publisher for Excel to pin ranges, pivot tables and charts to Power BI.

The user can manage the tiles – refresh them, remove them, in Excel.
Pinned items must be removed from the dashboard in the service (removing in Excel only deletes the connection).
The Power BI Publisher for Excel can also be used to connect from Excel to datasets that are hosted in the Power BI Service.
An Excel pivot table is generated with a connection (ODC file) to the data in Azure.

Q47. What are the differences between a Power BI Dataset, a Report, and a Dashboard?

Dataset: The source used to create reports and visuals/tiles.

A data model (local to PBIX or XLSX) or model in an Analysis Services Server
Data could be inside of model (imported) or a Direct Query connection to a source.

Report: An individual Power BI Desktop file (PBIX) containing one or more report pages.

Built for deep, interactive analysis experience for a given dataset (filters, formatting).
Each Report is connected to atleast one dataset
Each page containing one or more visuals or tiles.

Dashboard: a collection of visuals or tiles from different reports and, optionally, a pinned.

Built to aggregate primary visuals and metrics from multiple datasets.

Q48. What are the three Edit Interactions options of a visual tile in Power BI Desktop?

The 3 edit interaction options are Filter, Highlight, and None.

Filter: It completely filter a visual/tile based on the filter selection of another visual/tile.

Highlight: It highlight only the related elements on the visual/tile, gray out the non-related items.

None: It ignore the filter selection from another tile/visual.

Q49. What are some of the differences in report authoring capabilities between using a live or direct query connection such as to an Analysis Services model, relative to working with a data model local to the Power BI Desktop file?

With a data model local to the PBIX file (or Power Pivot workbook), the author has full control over the queries, the modeling/relationships, the metadata and the metrics.

With a live connection to an Analysis Services database (cube) the user cannot create new metrics, import new data, change the formatting of the metrics, etc – the user can only use the visualization, analytics, and formatting available on the report canvas.

With a direct query model in Power BI to SQL Server, for example, the author has access to the same features (and limitations) available to SSAS Direct Query mode.

Only one data source (one database on one server) may be used, certain DAX functions are not optimized, and the user cannot use Query Editor functions that cannot be translated into SQL statements.

Q50. How does SSRS integrate with Power BI?

Below are some of the way through which SSRS can be integrated with Power BI:

Certain SSRS Report items such as charts can be pinned to Power BI dashboards.
Clicking the tile in Power BI dashboards will bring the user to the SSRS report.
A subscription is created to keep the dashboard tile refreshed.
Power BI reports will soon be able to be published to SSRS portal

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page
Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it

For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com

Statistics Interview Questions

Q1. What is a Sample?

A. A data sample is a set of data collected and the world selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.

Q2. Define Population.

A. In statistics, population refers to the total set of observations that can be made. For example, if we are studying the weight of adult women, the population is the set of weights of all the women in the world

Q3. What is a Data Point?

A. In statistics, a data point (or observation) is a set of one or more measurements on a single member of a statistical population.

Q4. Explain Data Sets.

A. Data sets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software.

Q5. What is meant by the term Inferential Statistics?

A. Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible.

Q6. Give an example of Inferential Statistics

A. You asked five of your classmates about their height. On the basis of this information, you stated that the average height of all students in your university or college is 67 inches.

Q7. What is Descriptive Statistics?

A. Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread).

Q8. What is the range of data?

A1. It tells us how much the data is spread across in a set. In other words, it is defined as the difference between the highest and the lowest value present in the set.

X=[2 3 4 4 3 7 9]

Range(x)%return (9-2)=7

Q9. Define Measurement.

A. Data can be classified as being on one of four scales:

nominal
ordinal
interval
ratio

Each level of measurement has some important properties that are useful to know. For example, only the ratio scale has meaningful zeros.

Q10. What is a Nominal Scale?

A. Nominal variables (also called categorical variables) can be placed into categories. They don’t have a numeric value and so cannot be added, subtracted, divided or multiplied. They also have no order; if they appear to have an order then you probably have ordinal variables instead

Q11. What is an Ordinal Scale?

A. The ordinal scale contains things that you can place in order. For example, hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal scale.

Q12. What is an Interval Scale?

A. An interval scale has ordered numbers with meaningful divisions. Temperature is on the interval scale: a difference of 10 degrees between 90 and 100 means the same as 10 degrees between 150 and 160. Compare that to high school ranking (which is ordinal), where the difference between 1st and 2nd might be .01 and between 10th and 11th .5. If you have meaningful divisions, you have something on the interval scale.

Q13. Explain Ratio Scale.

A. The ratio scale is exactly the same as the interval scale with one major difference: zero is meaningful. For example, a height of zero is meaningful (it means you don’t exist). Compare that to a temperature of zero, which while it exists, it doesn’t mean anything in particular.

Q14. What do you mean by Bayesian?

A. Bayesians condition on the data observed and considered the probability distribution on the hypothesis. Bayesian statistics provides us with mathematical tools to rationally update our subjective beliefs in light of new data or evidence.

Q15. What is Frequentist?

A. Frequentists condition on a hypothesis of choice and consider the probability distribution on the data, whether observed or not. Frequentist statistics uses rigid frameworks, the type of frameworks that you learn in basic statistics, like:

Q16. What is P-Value??

A. In statistical significance testing, it is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

Q17. What is a Confidence Interval?

A. A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method.

Q18. Explain Hypothesis Testing.

A. Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis. Hypothesis testing is used to infer the result of a hypothesis performed on sample data from a larger population.

Q19. What is likelihood?

A. The probability of some observed outcomes given a set of parameter values is regarded as the likelihood of the set of parameter values given the observed outcomes.

Q20. What is sampling?

A. Sampling is that part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern.

Q21. What are Sampling Methods?

A. There are 4 sampling methods:

Simple Random
Systematic
Cluster
Stratified

Q22. What is Mode?

A. The mode of a data sample is the element that occurs the most number of times in the data collection.

X=[1 2 4 4 4 4 5 5]

Mode(x)% return 3

Q23. What is Median?

A. It is describes as the numeric value that separates the lower half of sample of a probability from the upper half. It can b easily calculated by arranging all the samples from highest to lowest (or vice-versa) and picking the middle one.

X=[2 4 1 3 4 4 3]

X=[1 2 3 3 4 4 4]

Median(x)% return 3

Q24. What is meant by Quartile?

A. It is a type of quantile that divides the data points into four or less equal parts(quarters). Each quartile contains 25% of the total observations. Generally, the data is arranged from smallest to largest.

Q25. What is Moment?

A. It is the quantitative measure of the shape of a set of points. It comprises of a set of statistical parameters to measure a distribution. Four moments are commonly used:

Mean
Skewness
Variance
Kurtosis

Q26. What is the Mean of data?

A. The statistical mean refers to the mean or average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points.

X=[1 2 3 3 6]

Sum=1+2+3+3+6=15

Mean(x)%return (sum/5)=3

Q27. Define Skewness.

A. Skewness is a measure of the asymmetric of the data around the sample mean. It it is negative, the data are spread out more to the left side of the mean than to the right. The vice-versa also stands true.

Q28. What is Variance?

A. It describes how far the value lies from the Mean. A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another. Variance is the average of the squared distances from each point to the mean.

Q29. Define Standard Deviation.

A. In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Q30. What is Kurtosis?

A. Kurtosis is a measure of how outlier-prone a distribution is. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

Q31. What is meant by Covariance?

A. Covariance measures the directional relationship between the returns on two assets. A positive covariance means that asset returns move together while a negative covariance means they move inversely. Covariance is calculated by analyzing at-return surprises (standard deviations from the expected return) or by multiplying the correlation between the two variables by the standard deviation of each variable. It gives the measure of how much two variable change together.

Q32. What is Alternative Hypothesis?

A. The Alternative hypothesis (denoted by H1 ) is the statement that must be true if the null hypothesis is false.

Q33. Explain Significance Level.

A. The probability of rejecting the null hypothesis when it is called the significance level α , and very common choices are α = 0.05 and α = 0.01.

Q34. Do you know what is Binary search?

A. For binary search, the array should be arranged in ascending or descending order. In each step, the algorithm compares the search key value with the key value of the middle element of the array. If the keys match, then a matching element has been found and its index, or position, is returned. Otherwise, if the search key is less than the middle element’s key, then the algorithm repeats its action on the sub-array to the left of the middle element or, if the search key is greater, on the sub-array to the right.

Q35. Explain Hash Table.

A. A hash table is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.

Q36. What is Null Hypothesis?

A. The null hypothesis (denote by H0 ) is a statement about the value of a population parameter (such as mean), and it must contain the condition of equality and must be written with the symbol =, ≤, or ≤.

Q37. When You Are Creating A Statistical Model How Do You Prevent Over-fitting?

A. It can be prevented by cross-validation

Q38. What do you mean by Cross-vlidation?

A. Cross-validation, it’s a model validation techniques for assessing how the results of a statistical analysis (model) will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice

Q39. What is Linear regression?

A. A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship.

Q40. What are the assumptions required for linear regression?

A. There are four major assumptions:

There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
The errors or residuals of the data are normally distributed and independent from each other
There is minimal multi-co linearity between explanatory variables
Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

Q41. What is Multiple Regression?

A. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable. A dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term. Multiple regression requires two or more predictor variables, and this is why it is called multiple regression.

Q42. What is a Statistical Interaction?

A. Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.

Q43. What is an example of a data set with a non-Gaussian distribution?

A.The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.

Q44. Define Correlation.

A. Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

For example: height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5’5” is less than the average weight of people 5’6”, and their average weight is less than that of people 5’7”, etc.

Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

Q45. What is primary goal of A/B Testing?

A. A/B testing refers to a statistical hypothesis with two variables A and B. The primary goal of A/B testing is the identification of any changes to the web page for maximizing or increasing the outcome of interest. A/B testing is a fantastic method for finding the most suitable online promotional and marketing strategies for the business.

Q46. What is meaning of Statistical Power of Sensitivity?

A. The statistical power of sensitivity refers to the validation of the accuracy of a classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is basically Predicted True Events/Total Events.

Q47. Explain Over-fitting.

A. In the case of over-fitting, the model is highly complex, like having too many parameters which are relative to many observations. The overfit model has poor predictive performance, and it overreacts to many minor fluctuations in the training data.

Q48. Explain Under-fitting

A. In the case of under-fitting, the underlying trend of the data cannot be captured by the statistical model or even the machine learning algorithm. Even such a model has poor predictive performance.

Q49. What is Long Format Data?

A. In the long format, every row makes a one-time point per subject. The data in the wide format can be recognized by the fact that the columns are basically represented by the groups.

Q50. What is Wide Format Data?

A. In the wide format, the repeated responses of the subject will fall in a single row, and each response will go in a separate column.

Complete path to master SQL before interview

We have interviewed a lot of candidates and found out that SQL is still something which is very less explored by people who want to get deep into this domain.

Remember – Data Science is not about all about SQL, but it’s a bread and butter for most of the jobs irrespective of your profile.

This is a small post which will cater around the ways to master your SQL skills.

I am assuming that you are a complete noob in SQL, skip according to your expertise

1. Start with either w3school or tutorials point.
It should not take more than 8-10 hours for you to complete the tutorial(irrespective of your Engineering branch/ current domain)

2. Go for SQLZoo. Solve all the questions.
If you get stuck, then try this link which have all the solved questions. Should take you somewhere 15 hours

3. Once this is done, create an account on HackerRank and try there SQL course.
Try all the easy questions first and then slowly move to the medium level questions. It should not take you more than 20 hours, earn a 4 star at least before moving ahead and do follow the discussion panel

If you are good with the above 3, then do try our four pages(This is not a self promotion, but we have hand picked some of the important questions which you should definitely solve before your interview)

https://thedatamonk.com/day-3-basic-queries-to-get-you-started/
https://thedatamonk.com/day-4-sql-intermediate-questions/
https://thedatamonk.com/day-5-sql-advance-concepts/
https://thedatamonk.com/day-6-less-asked-sql-questions/

You are already interview ready, send me a mail at nitinkamal132@gmail.com or contact@thedatamonk.com to get a free copy of our ebook or purchase it on Amazon

https://www.amazon.in/Questions-Crack-Business-Analyst-Interview-ebook/dp/B01K16FLC4
https://www.amazon.in/Write-better-queries-interview-Questions-ebook/dp/B076BXFGW1

You are not done yet, complete that HackerRank Hard Questions as well.

This will suffice the knowledge which you need to crack any Data Science SQL interview round.

For Python,R, and statistics we will have separate post.

Keep Learning 🙂
The Data Monk

Linear Regression Part 2 – Implementation of LR

We already know the assumptions of the Linear Regression. We will quickly go through the implementation part of the Linear Regression. We will be using R for this article(Read Sometimes, I am more comfortable in R 😛 )

Remember – The easiest part in any modeling project is to implement the model. The major pain point is cleaning the data, understanding important variables, and checking the performance of the model.

We will quickly go through the packages which you need, the functions and terminologies you need to understand in order to run a Linear Regression Model.

If you want to learn more about Linear Regression and ARIMA forecasting in R with 100 questions then you can try our book on Amazon ‘100 Linear Regression and ARIMA forecasting questions in R’

Packages in R
You don’t need any specific package to run the lm() function in R which is used to create a Linear Regression model.

Step 1 – Get your train and test dataset in a variable.
Remember – The name of columns and the number of columns in both the dataset should be same.

pizza <- read.csv(“C:\Users\User\Desktop\TDM Book\PaulPizza.csv”)
pizzaTest<- read.csv(“C:\Users\User\Desktop\TDM Book\PaulPizzaTest.csv”)

pizza contains the training dataset and pizza test contains the testing dataset

You can also do a multiple fold validation to randomly decide the training and test dataset. But we will try to keep it straight and simple. So, we have manually taken the training and testing dataset. We will not encourage you to do like this. Anyways, we have the training and testing dataset with ourself.

LinearModel <- lm(NumberOfPizzas ~ Spring+WorkingDays,
data = pizza)

You are creating a LR model by the name of LinearModel. The function lm() takes the dependent variable and two independent variables i.e Spring and Working Days. The dataset is pizza i.e. the training dataset of the problem statement.

Let’s look into the summary to analyze the model

summary(LinearModel

What are the residuals?
In regression analysis, the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Both the sum and the mean of the residuals are equal to zero.

What is coefficient-estimate?
It is the expected value of the number of pizzas which will be sold in a coming month, Since the accuracy of the model is too bad, the number is quite off. It shows that on an average 156691 pizzas is predicted to be sold. The second row shows the impact of each of the variable in this estimated calculation, this is called the slope term. For ex. The slope of Summer is 169.29, this suggests the effect that Summer has on the estimated value.

What is coefficient-standard error?
It suggests the average amount that the coefficient estimates vary from the actual value. The standard error can be used to compute and estimate of the expected difference.

What is coefficient-t value?
The t-value tells you how many Standard Deviation the coefficient is far away from 0. The more far it is, the easier it is to reject the null-hypothesis – i.e. we can declare a relationship between Y and x1,x2, etc.. In the above case only Working Days coefficient t-value is relatively far away from 0 which suggests a high correlation between number of pizzas sold and Working Days. This also helps us in understanding the p-value of that variable.

What should be the value of a good variable which we should include in our model?
A good variable for the model will have the following attributes:-
i. High Coefficient – t value
ii. Very low Pr(>|t|) value

What according to you is the most important variable in the model summary given above?
Looking at the t value and Pr value, Working Days seems to be the most relevant variable for the model

Now you need to use this model to predict the values for testing dataset which is stored in pizzatest

Function

pred_model <- predict(LinearModel,pizzaTest)

You can check the accuracy of the values by comparing it with the actual data for these three months.

There are more ways to check the performance of a Linear Regression model which we will discuss in the next article.

Linear Regression Part 1 – Assumptions and basics of LR
Linear Regression Part 3 – Evaluation of the model

Keep Learning 🙂

The Data Monk

Assumptions Of Linear Regression – Part 1

Here we will talk about the assumptions of Linear Regression which will let you understand LR and will help you tackle questions in your interview. I personally know a handful of candidates who have bosted themselves as a master of Linear Regression but have failed to answer basic questions(Read – Me :P)

Read the complete article and answer all the 10 questions which are there in the link below
https://thedatamonk.com/linear-regression-interview-questions-day-19/

That’s why We have tried to consolidate all the learnings from my peer group.

If you want to learn more about Linear Regression and ARIMA forecasting in R with 100 questions then you can try our book on Amazon ‘100 Linear Regression and ARIMA forecasting questions in R’

To understand a Data Science algorithm you need to cover at least these three things:-
1. Assumptions
2. Code and Mathematical Formulas
3. Evaluating Performance of your model

Regression is a method of modeling a dependent value based on independent variables. This model is used extensively to solve a basic forecasting problem.

Do, You want to know what was the next question asked by the panel(Lowe’s Technical Interview Round)?

What is the difference between forecasting and prediction?
These two do look the same, but there is a striking difference between them. While forecasting is based on the historical data and it’s mostly about projecting a line of future values by extending the historical trend.

Prediction is a judgment. It takes into account changes which are taking place in the future.

When you say that it will rain tomorrow, by looking at the historical data then that’s forecasting, but reading palm and telling your future is an example of prediction. So, be ready for all types of questions.

Let’s get back to Linear regression. In a layman term, we can say that Regression is a method to predict the outcome of a variable in the best possible way given the past data and its outcomes.

Example – Can you forecast the salary of the age group 40-45 using the given data?

Age	Salary
20-25	$2,000
25-30	$3,000
30-35	$3,700
35-40	$4,000

You can guess that the Salary would be somewhere in the range of $4,000 – $4,500 looking at the already given outcomes. This is how a basic Linear Regression works.

The core of Linear Regression is to understand how the outcome variable is dependent on the independent variables.

What are the assumptions of Linear Regression?

This is one of the most asked questions in an interview which revolves around Linear Regression. There are five major assumptions of a Linear Regression:-
1. Linear Relationship
2. Low or No Multicollinearity
3. No autocorrelation
4. Multivariate Normality
5. Homoscedasticity

You don’t have to learn these points, you need to understand each of these before diving in the implementation part.

1. Linear Relationship – A linear regression assumes that there is a straight line relationship between X and y in the equation given below

Y = Bo + B1X + Epsilon

Y = Bo + B1X is nothing but an equation of straight line

To do any statistical inference, we need to make some assumptions about the error term which is represented as Epsilon. The first assumption comes into the picture where we assume three things for this random error terms:-

a. Mean of the error is 0
b. Error is normally distributed
c. Error has a constant variance

Read the following line at least thrice to understand it

” Error Term ~ N(0, Variance) = This shows that every error term is normally distributed and have a mean of 0 and a constant variance”

Remember the equation Y = Bo+B1X+Epsilon ??

Now we can re-write the equation as Y ~ N(Bo+B1X,Variance)

This means that the Y is normally distributed with a mean of Bo+B1X and a constant variance.

So the first assumption goes like this “The dependent variable is a linear combination of the independent variable and the error term”

You can check the relationship between the dependent and independent variable by a scatter plot

A low or little linearity present in the dataset

2. Multivariate Normality

The second assumption states that every Linear combination of Y with the independent variables needs to have a univariate Normal distribution. Multivariate normality is not a fancy term, it is just a generalization of the one-dimensional normal distribution to higher dimensions

3. Low or No Multicollinearity

This one is easy to understand. Suppose I want to find out the selling price of a car and your independent variables are age of the car, Kilometers, health of engine, etc.

Now, we know that the number of Kilometers and age of the car will have high correlation(generally). The number of kilometers traveled by the car will increase with the age of the car.

Linear Regression states that using two variables with high correlation will complicate the model and will be of no use. So you need to chuck one of these. We will talk about two ways in which you can decide on removing one of the variables from age and kilometer

1. Variance Inflation Factor (VIF)
VIF > 100 is a direct indication of high multicollinearity. In a layman term, remove the variable with high VIF value

2. Correlation Matrix – Plot a correlation matrix to understand the strength of correlation between two variables. Take one out of two variable at a time and check the performance of the model

4. Low or No Correlation between the data

In the last point we talked about the multicollinearity between the independent variables. Now, we want to check if there is a correlation between the data itself.

In simple language, if the value of f(x+1) is dependent on f(x) then the data is having a correlation. A classic example is the share price where the price is dependent on the previous value. You can check the correlation of the data by either a scatter plot or a Durbin-Watsom test. The null hypothesis of Durbin-Watsom test is that “the residuals are not linearly correlated”

If 1.5<d<2.5 then the values are not auto-correlated.

See, if the data itself is correlated then it would be hard to know the impact of other variables on Y. So it is assumed that there is no or little correlation between the data

5. Homoscedasticity

If the residuals are equal across the regression line, then there is no homoscedasticity.

We will look into the implementation part in the next article which will be followed by the evaluation of performance.

Linear Regression Part 2 -Implementation of LR
L inear Regression Part 3 – Evaluation of the model

Keep Learning 🙂

The Data Monk

The measure of Spread in layman terms

Data Science is a combination of Statistics and Technology. In this article, we will try to understand some basic terminologies in Layman’s language.

Suppose I run a chain of Pizza outlets across Bangalore and have around 500 delivery boys. We have assured “less than 30 minutes delivery time” to our customers, but while going through the feedback forms, We can feel that the delivery executives are taking more than the promised time.

NULL hypothesis – The delivery time is less than 30 minutes. It is represented by Ho

Alternate Hypothesis – The delivery time is not less than 30 minutes or it is more than 30 minutes. It is represented as Ha.

We mostly try to test the Null hypothesis and see whether it’s true.

Population – Your total population is 500, which is the number of delivery boys

Sample – It’s not feasible to test the delivery time of each delivery boy, so we randomly take a small fragment of the population which is termed as Sample

You must have heard the term that a ‘p-value of 0.05 is good’, but what does that actually mean?

p-value helps you in determining the significance of your result in a hypothesis test. So, when you say that the p-value of the test is less than 0.05 then you sound like “There is strong evidence against your Null Hypothesis and you can reject it”

Similarly, when the p-value is significantly more than 0.05 then the Null Hypothesis stays strong as there is weak evidence against the Null Hypothesis.

In a layman’s term, if the hypothesis testing results in p-value less than 0.05 for the case mentioned above then we will be rejecting the null hypothesis by saying that the average amount of time to deliver a pizza is more than 30 minutes.

You must have got a fair bit of idea about population, sample, null hypothesis, alternate hypothesis, and p-value.

Let’s get back to sampling. There are four methods to get a segment out of a population i.e. sampling of a population:-

a. Random Sampling – Completely random selection
b. Systematic Sampling – A systematic way to sample a population, like taking the kth record from the population
c. Stratified Sampling – Suppose the complete population is divided into multiple groups, so stratified sampling will take a sample from each group. This reduces the biasness of the sample.
If we have a data set of people of different age group then a random sample might be biased towards a particular group. But, stratified sampling takes care of this
d. Cluster – When a population is divided into different clusters then we need to get an equal sample from each of these

We have the data set i.e. the population and we have taken a sample from it.
Now, we need to know the spread of the sample or the population.

I assume that you already know about mean, median, mode, etc.

The measure of spread describes how similar or varied a set of observed values are for a variable. The measure of spread includes:-

a. Standard Deviation
b. Variance
c. Range
d. Inter Quartile Range
e. Quartile

You can easily find a copy-book definition on the internet. Let’s try to understand it in simple terms.

Mean gives you an idea of average of the data.
Suppose the average salary of 5000 the employees at Walmart is $100,000.

The variance will give you an idea about the spread of the salary i.e. how far is your data point from the mean. We calculate Variance on either the complete population or the sample population.

Both the formulas are almost the same, the only difference is the denominator. If you just want to memorize the formulas then also it’s fine. But, to understand the denominator, you need to go through the concept of degree of freedom. Let’s try

Degree of freedom is nothing but the number of observations in the data that are free to vary when estimating a parameter. In simple words, if you know the mean or average of 5 numbers, then all you need to know is 4 numbers and you can easily get the 5th number. Here the degree of freedom is n-1 i.e. 4. This is an example of the degree of freedom.

Now the reason why Population variance has N and Sample variance has N-1 as the denominator?

When we have a population size of 1000 and we have to calculate Population variance then we already know the mean of the Population, thus we divide it with N.

Read this loud – When we only know the mean of the sample, then we divide the value with N-1 to “compensate on the fact that we don’t have concrete information about the population, thus we try to keep the overall value larger by dividing it with N-1”

Quartile is a number and not a group of values. It’s more like a cut-off value. You have 3 quartiles in statistics

Quartile 1 – Also called 25th percentile. This is the middle number between the smallest number and the median of the data set.

Quartile 2 – Also called the median and the 50th percentile

Quartile 3 – Also called the 75th percentile and upper quartile. This is the middle value between the median and the highest value of the data set

To find the quartiles of this data set, use the following steps:

Order the data from least to greatest.
Find the median of the data set and divide the data set into two halves.
Find the median of the two halves.

Interquartile Range = Q3-Q1
It is the midspread or the middle 50 percentile of the dataset. It’s also value and not a group of numbers

Bonus knowledge

How to identify an outlier?

A basic rule is that whenever a data point is more than 1.5 times of the third quartile or less than 1.5 times of the first quartile, then it’s termed as an outlier.

Ek example dekh lo, samjh aa jaeyga

Numbers – 2,4,5,7,9,11,13

Median = 4th term as we have 7 terms and the numbers are arranged in ascending order. Thus median(Q2) is 7

Quartile 1(Q1) = Median of the dataset containing the lower half of the data i.e. calculate the median of 2,4,5. Thus Q1 will be the 2nd term i.e. 4

Quartile 3(Q3) = Median of the upper half of the data i.e. median of 9,11,13. Thus median is 11

(2,4,5),7,(9,11,13) ~ (Q1),Q2,(Q3)

Inter Quartile Range = Q3-Q1 = 11-4 = 7

Keep Learning

The Data Monk

100 Natural Language Processing Questions in Python

What is NLP?
NLP stands for Natural Language Processing and it is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.
What are the uses of NLP?
Natural Language Processing is useful in various domains like Chat bots, Extracting insights from feedback and surveys, text-classification, etc.
What are the different algorithms in NLP?
NLP is used to analyze text, allowing machines to understand how human’s speak.
This human-computer interaction enables real-world applications like
a. automatic text summarization
b. sentiment analysis
c. topic extraction
d. named entity recognition
e. parts-of-speech tagging
f. relationship extraction
g. stemming, and more.
NLP is commonly used for text mining, machine translation, and automated question answering.
What problems can NLP solve?
NLP can solve many problems like, automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
What is Regular Expression?
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.
Regular expressions are a generalized way to match patterns with sequences of characters.
What are the different applications of Regular Expression in Data Science?
a. Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy so it shows you results related to you.
b. Social websites feed like the Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
c. Speech engines like Apple Siri.
d. Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.
What are the packages in Python to help in Regular ExpressionThe package which we commonly use for regular expression is re. We can import the package using following command

import re
What is match function?
import re
re.match(‘ni’,’nitin’)

Match=’ni’
What are the common patterns used in regular expression?
\w+ -> word
\d -> digit
\s -> space
\* ->wildcard
+ or * -> greedy match
\S -> anti space i.e. it matches anything which is not a space
[A-Z] – matches all the character in the range of capital A and capital Z
What are the important functions to use in Regular Expression?
findall() – It finds all the patterns in a string
search() – It search for a pattern
match() – It matches an entire string or a sub string
split() – It splits a string in Regular Expression. It returns a list object
What is the difference between match and search function?
Match tries to match the string from beginning whereas search matches it wherever it finds the pattern. The below example will help you understand better

import re
print(re.match(‘kam’, ‘kamal’))
print(re.match(‘kam’, ‘nitin kamal’))
print(re.search(‘kam’,’kamal’))
print(re.search(‘kam’,’nitin kamal’))
<re.Match object; span=(0, 3), match=’kam’>
None
<re.Match object; span=(0, 3), match=’kam’>
<re.Match object; span=(6, 9), match=’kam’>
Guess the output of the following
import re
re.split(‘\s’,’The Data Monk is cool’)

[‘The’,’Data’,’Monk’,’is’,’cool’]
Work in finding the output of the following
regx = r”\w+”
strx = “This isn’t my pen”
re.findall(regx,strx)

[‘This’, ‘isn’, ‘t’, ‘my’, ‘pen’]
How to write a regular expression to match some specific set of characters in a string?
special_char = r”[?/}{‘;]“
The above Regular Expression will take all the characters between []
Write a regular expression to split a paragraph every time it finds an exclamation mark

import re
exclamation = r”[!]”
strr = “Data Science comprises of innumerable topics! The aim of this 100 Days series is to get you started assuming ! that you have no prior! knowledge of any of these topics. “
excla = re.split(exclamation,strr)
print(excla)

[‘Data Science comprises of innumerable topics’, ‘ The aim of this 100 Days series is to get you started assuming ‘, ‘ that you have no prior’, ‘ knowledge of any of these topics. ‘]
Get all the words starting with a capital letter

capital = r”[A-Z]\w+”
print(re.findall(capital,strr))

[‘Data’, ‘Science’, ‘The’, ‘Days’]
Find the output of the following code?
digit = “12 34 98”
find_digit = r”\d+”
print(re.findall(find_digit,digit))

[’12’, ’34’, ’98’]
What is tokenization?
Tokenization is one of the most important part of NLP. It simply means to break down the string into smaller chunks. It breaks the paragraph into words, sentences, etc.
What is NLTK?
NLTK stands for Natural Language Toolkit Library and it is a package in Python which is very commonly used for tokenization.

from nltk.tokenize import word_tokenize
word_tokenize(“This is awesome!”)

[‘This’, ‘is’, ‘awesome’, ‘!’]
What are the important nltk tokenizer?

sent_tokenize – Tokenize a sentence
tweet_tokenize – This one is exclusively for tweets which can come handy if you are trying to do sentiment analysis by looking at a particular hashtag or tweets
regexp_tokenize – tokenize a string or document based on a regular expression pattern
What is the use of the function set() ?
The data type set is a collection. It contains an unordered collection of unique and immutable objects. So when you extract a set of words from a novel, then it will get you the distinct words from the complete novel. It is a very important function and it will continue to come in the book as you go ahead.
Tokenize the paragraph given below in sentence.
para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”

import nltk.tokenize import sent_tokenize
import nltk.tokenize import word_tokenize

para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”
sent = sent_tokenize(para)
print(sent)

[‘This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization.’, ‘This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead.’, ‘Be with Piyush and Pihu to understand Data Science and Machine Learning.’]
Now get all the words from the above paragraph

word = word_tokenize(para)

[‘This’, ‘is’, ‘the’, ‘story’, ‘about’, ‘Piyush,29’, ‘,’, ‘Senior’, ‘Data’, ‘Scientist’, ‘at’, ‘Imagine’, ‘Incorporation’, ‘and’, ‘myself’, ‘,’, ‘Pihu,24’, ‘,’, ‘Junior’, ‘Data’, ‘Scientist’, ‘at’, ‘the’, ‘same’, ‘organization’, ‘.’, ‘This’, ‘is’, ‘about’, ‘the’, ‘journey’, ‘of’, ‘Piyush’, ‘once’, ‘he’, ‘retired’, ‘from’, ‘his’, ‘job’, ‘,’, ‘after’, ‘being’, ‘unsatisfied’, ‘with’, ‘the’, ‘way’, ‘his’, ‘career’, ‘was’, ‘moving’, ‘ahead’, ‘.’, ‘Be’, ‘with’, ‘Piyush’, ‘and’, ‘Pihu’, ‘to’, ‘understand’, ‘Data’, ‘Science’, ‘and’, ‘Machine’, ‘Learning’, ‘.’]
Now get the unique words from the above paragraph
word=set(word_tokenize(para))
print(word)

{‘retired’, ‘ahead’, ‘the’, ‘about’, ‘with’, ‘Piyush,29’, ‘Senior’, ‘Piyush’, ‘being’, ‘Science’, ‘was’, ‘Imagine’, ‘at’, ‘journey’, ‘way’, ‘same’, ‘and’, ‘Pihu’, ‘Pihu,24’, ‘Learning’, ‘from’, ‘story’, ‘he’, ‘Be’, ‘Machine’, ‘once’, ‘to’, ‘unsatisfied’, ‘Junior’, ‘of’, ‘career’, ‘Data’, ‘moving’, ‘is’, ‘understand’, ‘.’, ‘myself’, ‘after’, ‘job’, ‘,’, ‘Incorporation’, ‘Scientist’, ‘organization’, ‘This’, ‘his’}
What is the use of .start() and .end() function?

Basically .start() and .end() helps you find the starting and ending index of a search. Below is an example:

x = re.search(“Piyush”,para)
print(x.start(),x.end())

24 30
What is the OR method?
OR method, as the name suggests is used to give condition to the regular expression. See the example below:-

x = r”\d+ | \w+”

The above regex will get you all the words and numbers, but it will ignore other characters like punctuation, ampersand, etc.
What are the advance tokenization techniques?
Take for example [A-Za-z]+, this will get you all the alphabets regardless of upper or lowercase alphabets
How to write a regex to match spaces or commas?
(/s+|,) – The /s+ will get you one or more spaces, and the pipe will mark an OR operator to take the comma into consideration
How to include special characters in a regex?
If you have any experience with regular expression or SQL queries, then this syntax will look familiar. You need to give a backward slash before any special character like below

(\,\.\?) – This will consider comma, full stop and question mark in the text
What is the difference between (a-z) and [A-Z]?
This is a very important concept, when you specify (a-z), it will only match the string “a-z”. But when you specify [A-Z] then it covers all the alphabet between upper case A and Z.
Once again go through the difference between search() and match() function
Search() will find your desired regex expression anywhere in the string, but the match always looks from the beginning of the string. If a match() function hits a comma or something, then it will stop the operation then and there itself. Be very particular on selecting a function out of these
What is topic modeling?
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.
What is bag-of-words?
Bag-of-words is a process to identify topics in a text. It basically counts the frequency of the token in a text. Example below to help you understand the simple concept of bag-of-words

para = “The game of cricket is complicated. Cricket is more complicated than Football”

The – 1
game – 1
of-1
cricket-1
is-2
complicated-2
Cricket – 1
than – 1
Football – 1

As you can see, the word cricket is counted two times as bag-of-words is case sensitive.
How to counter the case sensitive nature of bag-of-words?
It’s a logical question, just convert every word in lower or upper case and then count the words. Look for question 35 to convert every word in lower case using loop.
What is counter?
A counter is a container that keeps count of number of times equivalent values are added. It looks similar to dictionary in Python. Counter supports three forms of initialization. Its constructor can be called with a sequence of items, a dictionary containing keys and counts, or using keyword arguments mapping string names to counts.
How to import Counter in Python?
Counter is present in the Collection package, you can use it directly by importing it like below:

from collections import Counter
Use the same paragraph used above and print the top 3 most common words
The code is self explanatory and is given below:

word2 = word_tokenize(para)
lower_case = [t.lower() for t in word2]
bag_of_words = Counter(lower_case)
print(bag_of_words.most_common(3))

[(‘the’, 4), (‘,’, 4), (‘data’, 3)]
What is text preprocessing?
text pre processing is a complete process to make the text ready for analysis by removing stop words, common punctuations, spelling mistakes, etc. Before any analysis you are suppose to process the text.
What are the commonly used methods of text preprocessing?
Converting the complete text in either lower or upper case
Tokenization
Lemmatization/Stemming
Removing stop words
How to tokenize only words from a paragraph while ignoring the numbers and other special character?

x = “Here is your text. Your 1 text is here”
from nltk.corpus import stopwords
only_alphabet = [w for w in word_tokenize(x.lower())
if w.isalpha()]
print(only_alphabet)

w.isalpha() function will check if the word has only text in it and will remove the numbers

Output
[‘here’, ‘is’, ‘your’, ‘text’, ‘your’, ‘text’, ‘is’, ‘here’]
What are stop words?
Stop words are common occurring words in a text which have high frequency but less importance. Words like the, are, is, also, he, she, etc. are some of the examples of English stop words.
How to remove stop words from my text?
from nltk.corpus import stopwords
para = “Your text here. Here is your text”
tokens = [w for w in word_tokenize(para.lower)
if w.isalpha()]
stoppy = [t for t in tokens
if t not in stopwords.words(‘english’)]
What is Lemmatization?
Lemmatization is a technique to keep words in its base form or dictionary form of the word. Example will help you understand better

The lemma of better will be good.
The word “walk” is the base form of the word “Walking”
Give an example of Lemmatization in Python
x = “running”
import nltk
nltk.download(‘wordnet’)
lem.lemmatize(x,”v”)

Output
‘run’
How to lemmatize the texts in your paragraph?
Use the module WordNetLemmatizer from nltk.stem

from nltk.stem import WordNetLemmatizer
lower_tokens = word_tokenize(para)
lower_case = [t.lower() for t in lower_tokens]
only_alphabet = [t for t in lower_case if t.isalpha()]
without_stops = [x for x in only_alphabet if x not in stopwords.words(“English”)
lemma = WordNetLemmatizer()
lemmatized = [lemma.lemmatize(t) for t in without_stops]
What is gensim?
Gensim is a very popular open-source NLP library. It is used to perform complex tasks like:-
a. Building document or word vectors
b. Topic identification
What is a word vector?
Word vector is a representation of words which helps us in observing relationships between words and documents. Based on how the words are used in text, the word vector help us to get meaning and context of the words. Example, the word vector will connect Bangalore to Karnataka and Patna to Bihar where Bangalore and Patna are capital of the Indian state Karnataka and Bihar.

These are multi-dimensional mathematical representation of words created using deep learning method. They give us insight into relationships between words in a corpus.
What is LDA?
LDA is used for topic analysis and modeling. It is used to extract the main topics from a dataset. LDA stands for Latent Dirichlet Allocation. Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.
What is gensim corpus?
Gensim corpus converts the tokens in bag or words. It gives result in a list of (token id, token reference). The gensim dictionary can be updated and reused
What is stemming?
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.
Give an example of stemming in Python
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
x = “running”
stem.stem(x)

Output
‘run’
What is tf-idf?
Term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.

The weight will be low in two cases:-
a. When the term frequency is low i.e. number of occurrence of a word is low
b. When N is equal to dfi, then the log will be close to zero

So, using (b), if a word occurs in all the document, then the log value will be low

If the word “abacus” is present 5 times in a document containing 100 words. The corpus has 200 documents, with 20 documents mentioning the word “abacus”. The formula for tf-idf will be :-

(5/100)*log(200/20)

53. How to create a tf-idf model using gensim?

from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tf_idf_weights = tfidf([doc])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
print(dictionary.get(term_id), weight)

54. What is Named Entity Recognition?
It is a process of identifying important named entity texts in a document. Ex. organization, dashboard names, work of arts, etc.
It is present in the ne_chunk_sents() function in nltk package. It can be used as below:-

chunk_Sent = nltk.ne_chunk_sents(Part_Of_Speech_sentence_token, binary = True)

55. What is POS?
Part of Speech tag in Natural Language Processing is used to tag a word according to its use in the sentence. It tags the word as a part of speech.
It is present as pos_tag() in nltk package. You can feed the tokenized word in a loop to get the POS tag for each word like below:-

pos = [nltk.pos_tag(x) for x in tokenized_word_variable]

56. What is the difference between lemmatization and stemming?
Lemmatization gets to the base of the word whereas stemming just chops the tail of the word to get the base form. Below example will serve you better:

See is the lemma of saw, but if you try to get the stem of saw, then it will return ‘s’ as the stem.
See is the lemma of seeing, stemming seeing will get you see.

54. What is spacy package?
Spacy is a very efficient package present in Python which helps in easy pipeline creation and finding entities in tweets and chat messages.

55. How to initiate the English module in spacy?
import spacy
x = spacy.load(‘en’,tagger=False,parser=False,matcher=False)

56. Why should one prefer spacy over nltk for named entity recognition?
Spacy provides some extra categories, other than the one provided by nltk.

These categories are:-
-NORP
-Cardinal
-money
-Work of art
-Language
-Event

So, you can try spacy for NER according to your need

57. What are the different packages which uses word vectors?
Spacy and gensim are the two packages which we have covered so far that uses word vectors.

58.What if your text is in various different languages? Which package can help you in Named Entity Recognition for most of the largely spoken languages?
Polygot is one of the package which supports more than 100 languages and uses word vector for Named Entity Recognition

59.What is supervised learning?
Supervised learning is a form of Machine Learning where your model is trained by looking at a given output for all the inputs. The model is trained on this input-output combination and then the learning of the model is tested on the test dataset. Linear Regression and Classification are two examples of supervised learning.

60. How can you use Supervised Learning in NLP?
Suppose you have a chat data and looking at the keyword you have specified the sentiment of the customer. Now you have got a set of data which have complete chat and the sentiment associated with the chat. Now you can use supervised learning to train the data on this dataset and then use it while there is alive chat to identify the ongoing sentiment of the customer.

61. What is Naïve-Bayes model?
Naive Bayes classifiers are linear classifiers that are known for being simple yet very efficient. The probabilistic model of naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are mutually independent.

62.What is the flow of creating a Naïve Bayes model?
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# Fit the classifier to the training data
nb_classifier.fit(count_train,y_train)
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,pred)
print(score)
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred,labels=[‘FAKE’,’REAL’])
print(cm)

Let’s take some sample text and try to implement basic algorithms first

63. What is POS?
POS stands for Parts of Speech tagging and it is used to tag the words in your document according to Parts of Speech. So, noun, pronoun, verb, etc. will be tagged accordingly and then you can filter what you need from the dataset. If I am just looking for names of people mentioned in the comment box then I will look for mainly Nouns. This is a basic but very important algorithm to work with.

64. Take an example to take a sentence and break it into tokens i.e. each word
text = “The Data Monk will help you learn and understand Data Science”
tokens = word_tokenize(text)
print (tokens)
[‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]

65. Take the same sentence and get the POS tags

from nltk import word_tokenize, pos_tag
text = “The Data Monk will help you learn and understand Data Science” tokens = word_tokenize(text)
print (pos_tag(tokens))

[(‘The’, ‘DT’), (‘Data’, ‘NNP’), (‘Monk’, ‘NNP’), (‘will’, ‘MD’), (‘help’, ‘VB’), (‘you’, ‘PRP’), (‘learn’, ‘VB’), (‘and’, ‘CC’), (‘understand’, ‘VB’), (‘Data’, ‘NNP’), (‘Science’, ‘NN’)]

66. Take the following line and break it into tokens and tag POS using function
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

#Tokenize the words and apply POS
def token_POS(token):
token = nltk.word_tokenize(token)
   token = nltk.pos_tag(token)
   return token
token = token_POS(data)

Output

67. What is NER?
NER stands for Named Entity Recognition and the work of this algorithm is to extract specific chunk of data from your text data. Suppose you want to get all the Nouns from the dataset . It is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes. Etc.

68. What are some of the common tags in POS. You need to know the meaning of the tags to use it in your regular expression
DT – Detreminer
FW – Foreign word
JJ – Adjective
JJR – Comparative Adjective
NN – Singular Noun
NNS – Plural Noun
RB – Adverb
RBS – Superlative Adverb
VB – Verb

You can get the complete list on the internet.

69. Implement NER on the tokenized and POS tagged sentence used above.
nltk.download(‘maxent_ne_chunker’)
nltk.download(‘words’)
ne_chunked_sents = nltk.ne_chunk(token)
named_entities = []
for tagged_tree in ne_chunked_sents:
    if hasattr(tagged_tree, ‘label’):
        entity_name = ‘ ‘.join(c[0] for c in tagged_tree.leaves())
        entity_type = tagged_tree.label() # get NE category
named_entities.append((entity_name, entity_type))
print(named_entities)

[(‘Data Monk’, ‘ORGANIZATION’), (‘Bangalore’, ‘GPE’), (‘Data Science’, ‘PERSON’), (‘Amazon’, ‘ORGANIZATION’)]

Code Explanation
nltk.download will import maxent_ne_chunker which is used to break the sentence into named entity chunks and nltk.download(‘words’) will download the dictionary

We already have a variable token which contains POS tagged tokens. nltk.ne_chunk(token) will tag the tokens to Named entity chunks.

function hasattr()is used to check if an object has the given named attribute and return true if present, else false.

.leaves() function is used to get the leaves of the node and label() will get you the NER label

70.What are n-grams?
A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text.

71. Create a 3-gram of the sentence below
“The Data Monk was started in Bangalore in 2018″

def ngrams(text, n):
    token = text.split()
    final = []
    for i in range(len(token)-n+1):
        final.append(token[i:i+n])
    return final
ngrams(“The Data Monk was started in Bangalore in 2018”,3)

Output

72. What is the right order for a text classification model components?

Text cleaning
Text annotation
Text to predictors
Gradient descent
Model tuning

73. What is CountVectorizer?
CountVectorizer is a class from sklearn.feature_extraction.text. It converts a selection of text documents to a matrix of token counts.

———————————————————

Let’s take up a project and try to solve it using NLP. Here we will only create the dataset and will apply Random forest and NLP to train our dataset to identify the sentiment of a review

Objective of the project is to predict the correct tag i.e. whether people liked the food or not using NLP and Random Forest.

74. How to create a dataset? What to write in it?
Open an excel file and save it as Reviews (in the csv format). Now make two columns in the sheet like the one given below

Review	Liked
This restaurant is awesome	1
Food not good	0
Ambience was wow	1
The menu is good	1
Base was not good	0
Very bad	0
Wasted all the food	0
Delicious	1
Great atmosphere	1
Not impressed with the food	0
Nice	1
Bad taste	0
Great presentation	1
Lovely flavor	1
Polite staff	1
Bad management	0

Basically you can write the review of anything like Movies, food, restaurant, etc. Just make sure to keep the format like this. Thus your dataset is ready.

75. What all packages do I need to import for this project?
It’s always good to start with importing all the necessary packages which you might use in the project

import re
import pandas as pd
import numpy as np
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

We will discuss each of these as we tackle the problem

76. How to import a csv file in Python?
Importing csv file in python requires importing pandas library and using read_csv function

review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’)

77. Let’s view the top and bottom 5 lines of the file to make sure we are good to go with the analysis
Use the commands given below
review.head() and review.tail()

78. Now we will clean the dataset. Will start with removing numbers and punctuations. Write a regular expression for removing special characters and numbers

review is the name of the data set and Review is the name of the column

final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )

79. What is sub() method?
The re.sub() function in the re module can be used to replace substrings.

The syntax for re.sub() is re.sub(pattern,repl,string).

That will replace the matches in string with repl.

80. Convert all the text into lower case and split the words
final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()

81. Now we want to stem the words. Do you remember the definition of stemming?
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.

final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]

82. What does the above snippet do?
port = PorterStemmer() allocates the stemming function to the variable port
port.stem(words) for words in x –
It takes all the words individually. Also remove the words which are stopwords.

x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]

The above loop will get all the non-stop words and stem the words

83. Create the final dataset with only stemmed words.
final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]
    x = ‘ ‘.join(x)
    final.append(x)

Let’s see how the final dataset looks like after removing the stop words and stemming the text

84. How to use the CountVectorizer() function? Explain using an example
from sklearn.feature_extraction.text import CountVectorizer
corpus = [‘The Data Monk helps in providing resource to the users’,
‘It is useful for people making a career in Data Science’,
‘You can also take the 100 days Challenge of TDM’]
counter = CountVectorizer()
X = counter.fit_transform(corpus)
print(counter.get_feature_names())
print(X.toarray())

get_feature_name() will take all the words from the above dataset and will arrange it in an alphabetical order
fit_transform() will transform each line of the dataset as compared to the result of get_feature_name()
toArray will change the datatype to Array

Lets understand the output

[‘100’, ‘also’, ‘can’, ‘career’, ‘challenge’, ‘data’, ‘days’, ‘for’, ‘helps’, ‘in’, ‘is’, ‘it’, ‘making’, ‘monk’, ‘of’, ‘people’, ‘providing’, ‘resource’, ‘science’, ‘take’, ‘tdm’, ‘the’, ‘to’, ‘useful’, ‘users’, ‘you’]

         [[0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 2 1 0 1 0]
         [0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0]
         [1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1]]

The first output is the 26 unique words from the 3 lines of document arranged in alphabetical order.
The next three contains the presence of the above words in the document. 0 present in the 1,2,3, and 4^th place of the first row suggests that the words 100, also, can, and career are not present in the first line of the input.
Similarly 2 present on the 22^nd position shows that the word “the” is present twice in the first row of input
The first row of input is “The Data Monk helps in providing resource to the users”

85. Now let’s apply CountVectorizer on our dataset
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1000)
X = cv.fit_transform(final).toarray()

max_feature = 1500 will make sure that at max 1000 words are put into the master array. In case you are planning to apply this on a huge dataset, then do increase the max_feature component.
X will have the same array of occurrence across all the features as we have seen in the above example

86. How to separate the dependent variable?
As we know we want to see whether the review was positive or not. So the dependent variable here is the second column and we have put the value of the second column in a different variable i.e. y

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(final).toarray()
y = review.iloc[:,1].values

So, X has the array containing an array of occurrence of different words across all the words and y has the binary value where 1 denotes like and 0 denotes did not like

87. Now we need to split the complete data set into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

You already know about X and y, the test_size will divide the train and test dataset in 75:25 ratio respectively
Now you will have to train the model on X_train and y_train.

88. Random forest is one of the best model to work on supervised learning. By the way, what is Random forest?
Before we start with explaining a forest, we need to know what is a tree? Random forest is made of decision trees. To illustrate the concept, we’ll use an everyday example: predicting the tomorrow’s maximum temperature for our city. To keep things straight, I’ll use Seattle, Washington, but feel free to pick your own city.
In order to answer the single max temperature question, we actually need to work through an entire series of queries. We start by forming an initial reasonable range given our domain knowledge, which for this problem might be 30–70 degrees (Fahrenheit) if we do not know the time of year before we begin. Gradually, through a set of questions and answers we reduce this range until we are confident enough to make a single prediction.

Since temperature is highly dependent on time of year, a decent place to start would be: what is the season? In this case, the season is winter, and so we can limit the prediction range to 30–50 degrees because we have an idea of what the general max temperatures are in the Pacific Northwest during the winter. This first question was a great choice because it has already cut our range in half. If we had asked something non-relevant, such as the day of the week, then we could not have reduced the extent of predictions at all and we would be back where we started. Nonetheless, this single question isn’t quite enough to narrow down our estimate so we need to find out more information. A good follow-up question is: what is the historical average max temperature on this day? For Seattle on December 27, the answer is 46 degrees. This allows us to further restrict our range of consideration to 40–50 degrees. Again, this was a high-value question because it greatly reduced the scope of our estimate.

We need to have similar questions and once we put everything in a flow we will get a decision tree.
So, to arrive at an estimate, we used a series of questions, with each question narrowing our possible values until we were confident enough to make a single prediction. We repeat this decision process over and over again in our daily lives with only the questions and answers changing.

89. What is Random Forest?
Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.

Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.

You can read a lot on Medium.com for the explanation of Decision Tree and Random Forest in layman’s term

90. Let’s create our Random forest model here
model = RandomForestClassifier(n_estimators = 10,

criterion = ‘entropy’)

model.fit(X_train, y_train)

91. Define n_estimator
n_estimator is basically the number of trees you want to create in your forest. Try to vary the number of trees in this forest.
In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, i.e. at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.

Random forests are ensemble methods, and you average over many trees. Similarly, if you want to estimate an average of a real-valued random variable (e.g. the average heigth of a citizen in your country) you can take a sample. The expected variance will decrease as the square root of the sample size, and at a certain point the cost of collecting a larger sample will be higher than the benefit in accuracy obtained from such larger sample.

92. Define criterion. Why did you use entropy and not gini?
Gini is intended for continuous attributes and Entropy is for attributes that occur in classes.
Gini is to minimize misclassification

Entropy is for exploratory analysis
Entropy is a little slower to compute

93. What is model.fit()?
model.fit() helps you in create your model. The two parameters are that of training dataset i.e. X_train and y_train. It will take the values or the output of the reviews and will create a lot of decision trees to fit the output on the basis of input. These rules will be applied to your testing dataset to get the results

94. Let’s predict the output for the testing dataset
y_pred = model.predict(X_test)

You have just created the model on X_train and y_train. Now you need to predict the output for X_test. We already have the output for these, but we want our model to predict the answer so that we can match the answers or output

95. Now let’s check the confusion matrix to see how many of our outputs were correct
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

96. Lastly, what is a confusion matrix and how to know the accuracy of the model?
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Let’s take an example of a confusion matrix

So, our rows contain real values for a binary classifier and the columns have our predicted values. 50 and 100 show that the predicted and actual values were correctly identified. 10 and 5 show that the predicted values were not correct. Explore precision, recall, etc.

As far as accuracy is concerned, the formula is simple = (50+100)/(50+10+5+100)
i.e. total correct prediction divided by all the prediction.

Our model had very less dataset. The confusion matrix resulted in the following

Therefore accuracy = (1+3)/(1+0+0+3) = 100% accuracy
Yeahhhh..we are perfect

Complete code

import re
import pandas as pd
import numpy as np
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’)
review.tail()
final = []

for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]
    x = ‘ ‘.join(x)
    final.append(x)

cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(final).toarray()
y = review.iloc[:,1].values
print(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 501,
                            criterion = ‘entropy’)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
cm = confusion_matrix(y_test, y_pred)
cm

Damn !! I got out in the nervous 90’s 😛

This is all you need to hop on a real life problem or a hackathon. Do comment you find any flaw in the code.

Keep Learning 🙂

The Data Monk