10 Must have SQL questions

What are the different Analytic functions available in SQL Server?

  • FIRST_VALUE(): Returns the first value in an ordered set of values. If Partition By clause is specified
    then it returns First Value in each partition after ordering the partition by Order By Clause.
  • LAST_VALUE(): Returns the Last value in an ordered set of values. If Partition By clause is specified
    then it returns Last Value in each partition after ordering the partition by Order By Clause.
  • LAG(): Provides access to a row at a given physical offset that comes before the current row. Use this function in a SELECT statement to compare values in the current row with values in a previous row as
    specified by offset. Default offset is 1 if not specified. If Partition By clause is specified then it returns the offset Value in each partition after ordering the partition by Order By Clause.
  • LEAD(): Provides access to a row at a given physical offset that comes after the current row. Use this function in a SELECT statement to compare values in the current row with values in a subsequent row
    as specified by offset. Default offset is 1 if not specified. If Partition By clause is specified then it returns the offset Value in each partition after ordering the partition by Order By Clause.
  • PERCENT_RANK(): Used to evaluate the relative standing of a value within a query result set or partition. The range of values returned by PERCENT_RANK is greater than 0 and less than or equal to 1.

2. What are the different Ranking Window functions in SQL Server?

SYNTAX:

 ROW_NUMBER|RANK|DENSE_RANK|NTILE () OVER ([PARTITION BY ]
 ORDER BY )

The PARTITION BY clause is optional. If not used, data will be ranked based on a single partition.

  • ROW_NUMBER(): Always generate unique values without any gaps, even if there are ties.
  • RANK(): Ranks each row in the result set. Can have gaps in its sequence and when values are the same,they get the same rank.
  • DENSE_RANK(): It also returns the same rank for ties, but doesn’t have any gaps in the sequence.
  • NTILE(): Divides the rows in roughly equal sized buckets. Suppose you have 20 rows and you specify
    NTILE(2). This will give you 2 buckets with 10 rows each. When using NTILE(3), you get 2 buckets with 7
    rows and 1 bucket with 6 rows.

3. Which operator is used for Pattern Matching?

LIKE operator is used for pattern matching. It supports below wildcards.
 % : Matches any string of zero or more characters.
 _ : Matches any single character.
 [] : Matches any single character within the specified range ([a-f]) or set ([abcdef]).
 [^] : Matches any single character not within the specified range ([^a-f]) or set ([^abcdef])

4. What is Partitioning and its benefits?

SQL Server supports table and index partitioning. Partitioning is a way to divide a large table into smaller, more manageable parts without having to create separate tables for each part. Data in a partitioned table is physically stored in groups of rows called partitions and each partition can be accessed and maintained separately. Partitioning is not visible to end-users, a partitioned table behaves like one logical table when queried.
Benefits:

  • You can transfer or access subsets of data quickly and efficiently, while maintaining the integrity of a data collection
  • You can perform maintenance operations on one or more partitions more quickly. The operations are more efficient because they target only these data subsets, instead of the whole table.
    It is mostly intended to aid in maintenance on larger tables and to offer fast ways to load and remove large amounts of data from a table. Partitioning can enhance query performance, but there is no
    guarantee.

5. What is Index in SQL?

Indexes speed up the querying process by providing swift access to rows in the data tables, similarly tothe way a book’s index helps you find information quickly within that book. When a SQL Server has no index to
use for searching, the result is similar to the reader who looks at every page in a book to find a word; the SQL engine needs to visit every row in a table. In database terminology, we call this behavior a table scan, or just
scan. Indexes can be created on basically all columns except Large object data types such as Image, Text and Varchar (Max).
However, varchar(max), varbinary(max), and xml data types can participate in a non-clustered index as non-key index columns.

6. What are the different types of Joins in SQL?

  • INNER JOIN: Return rows only when there is a match in both tables.
  • LEFT OUTER JOIN: Returns all rows from Left Table even there isn’t any match in the Right table.
    Place NULL in the Right table when there isn’t any match with Left Table.
  • RIGHT OUTER JOIN: Opposite to LEFT OUTER JOIN. Returns all rows from the Right Table even if there isn’t any match in the Left table. Place NULL in Left table when there isn’t any match with
    Right Table.
  • FULL OUTER JOIN: Its combination of LEFT and RIGHT JOIN. It returns all rows from both tables
    and place NULL when there isn’t any match in both tables.
  • CROSS JOIN: Return the Cartesian Product of both Tables. i.e M*N rows.

7. What is Normalization? How many Normalization forms are there?

Normalization is used to organize the data in such a manner that data redundancy will never occur in the database and avoid insert, update and delete anomalies.
There are 5 forms of Normalization
First Normal Form (1NF): It removes all duplicate columns from the table. Creates a table for related data and identifies unique column values
Second Normal Form (2NF): Follows 1NF and creates and places data subsets in an individual table and defines the relationship between tables using the primary key
Third Normal Form (3NF): Follows 2NF and removes those columns which are not related through the primary key
Fourth Normal Form (4NF): Follows 3NF and does not define multi-valued dependencies. 4NF also known as BCNF

8. What are the different types of Indexes?
Ans.

  • Clustered Index: Clustered indexes sort and store the data rows in the table or view based on their key values. Key values are the column (or columns) included in the index definition. There can be only one
    clustered index per table, because the data rows themselves can be stored in only one order.
  • Non-Clustered Index: Non-clustered indexes cannot be sorted like clustered indexes. However, you can create more than one non-clustered index per table or view. The index in the back of a book is an
    example of a non-clustered index. A non-clustered index has the indexed columns and a pointer or bookmark pointing to the actual row. In the case of our example it contains a page number. Another
    example could be a search done on Google or another of the search engines. The results on the page contain links to the original web pages.
    In addition to an index being clustered or non-clustered, it can be configured in other ways:
    a. Composite index: An index that contains more than one column. You can include up to 16 columns in an index, as long as the index doesn’t exceed the 900-byte limit. Both clustered and non-clustered
    indexes can be composite indexes.
    b. Unique Index: An index that ensures the uniqueness of each value in the indexed column. If the index is a composite, the uniqueness is enforced across the columns as a whole, not on the individual columns. For example, if you were to create an index on the FirstName and LastName columns in a
    table, the names together must be unique, but the individual names can be duplicated.
    c. Covering Index: When all of the required columns in the SELECT list are part of the index, it is called a covering index. It is created using INCLUDE statement and can be created only with Non-Clustered
    Index. It can include Non-key columns in the Index to significantly improves the query performance because the query optimizer can locate all the column values within the index, table or clustered index data is not accessed resulting in fewer disk I/O operations.

9. What are the disadvantages of using an Index?

  • Disk Space: Indexes are stored on the disk, and the amount of space required will depend on the size of the table, and the number and types of columns used in the index. Disk space is generally cheap enough to trade for application performance, particularly when a database serves a large number of users.
  • Data Modification: Any time a query modifies the data in a table (INSERT, UPDATE, or DELETE), the database needs to update all of the indexes where data has changed. Hence, providing too many
    indexes to update can actually hurt the performance of data modifications.

10. What are the factors to be considered while creating an Index?

Well-designed indexes can reduce disk I/O operations and consume fewer system resources therefore improving query performance. Consider the following guidelines when you design an Index.

  • Understand the characteristics of the columns used in the queries. For example, an index is ideal for
    columns that have an integer data type and are also unique or nonnull columns.
  • Create non-clustered indexes on the columns that are frequently used in predicates(Where, Having) and join conditions in queries. However, you should avoid adding unnecessary columns. Adding too
    many index columns can adversely affect disk space and index maintenance performance.
  • Avoid over-indexing heavily updated tables and keep indexes narrow, that is, with as few columns as possible.
  • Use many indexes to improve query performance on tables with low update requirements, but large volumes of data. Large numbers of indexes can help the performance of queries that do not modify
    data, such as SELECT statements.
  • Indexes on views can provide significant performance gains when the view contains aggregations, table joins, or a combination of aggregations and joins
  • Determine which index options might enhance performance when the index is created or maintained.
    For example, creating a clustered index on an existing large table would benefit from the ONLINE index option. The ONLINE option allows for concurrent activity on the underlying data to continue
    while the index is being created or rebuilt.
  • Determine the optimal storage location for the index. A non-clustered index can be stored in the same filegroup as the underlying table, or on a different filegroup. The storage location of indexes can improve query performance by increasing disk I/O performance. For example, storing a non-clustered
    index on a filegroup that is on a different disk than the table filegroup can improve performance because multiple disks can be read at the same time. Alternatively, clustered and non-clustered
    indexes can use a partition scheme across multiple filegroups. Partitioning makes large tables or
    indexes more manageable by letting you access or manage subsets of data quickly and efficiently, while maintaining the integrity of the overall collection

Snippet from an article by Ombir Rathee

Keep Learning 🙂

The Data Monk



Top 5 Regression Techniques

How many regression techniques do you know?
Linear Regression?
Logistic Regression?

These two are the building blocks of your Data Science career, there are N number of Regression models, but we will try to cover the top 5 models which can come handy to you.

First of all, What on earth is regression?
In a layman term, Regression is a way to extrapolate your dataset to predict values on the basis of independent variables.

If you know the attendance, class test marks, last year records of a student, then you can predict his/her performance in the current semester.

If you know the weight, BMI, sugar level, etc. of a patient, then you can predict if he/she will suffer from Diabetes in the future.

In the same way, you can predict a lot of things, using regression you can observe the relationship between independent and dependent variable of your model. And you can also check the correlation between the variables.

This simple method will help you to evaluate important variables and then use it in your model directly.

Linear Regression
Linear Regression is the king of all the regression model to predict continuous values.
Okay, so Linear Regression predicts continuous values and Logistic Regression predicts probability values. So, Linear is used to predict some value(like the cost of an apartment, number of upcoming tickets, etc.) and Logistic is used to classify things(like, if a person will suffer from some disease, probability of a client to buy insurance, etc.)

Linear Regression is a simple Y = mX + C line where we are predicting the values of Y on the basis of various X and a constant C

Assumptions of Linear Regression (Interview Question):-
1. Linear relationship between independent and dependent variable
2. Less or no Multicollinearity
3. No auto-correlation  – Autocorrelation occurs when the residuals are not independent from each other.  In other words when the value of y(x+1) is not independent from the value of y(x).
4. Homoscedasticity

The linear regression analysis requires all variables to be multivariate normal

I have the freedom of speech 😛

Dekho, Linear Regression main tumhe predict krna hota hai ek value using the equation Y=mX+C, yahan X main saare independent variable aa jaate hain ex. m1X1,m2X2, etc. C toh constant hin hai. Ye equation ek line deti hai aur issi ko extrapolate kr ke tumhe future values milti hai. Ye X1, X2 aapas main correlated nai hone chahiye matlab aisa na ho ke X1 aur X2 dono variable ka impact same hin ho Y pe. R square, Adjusted R square, Accuracy, etc. se ye maluum hota hai ki model kitna sahi bna hai. Baaki iss pe ek pura article hai website pe, khoj ke padh lena

Logistic Regression

Let’s start with a simple definition – Logistic Regression is mostly used for classification and that too for binary classification. Suppose you want to classify if an advertisement will receive a hit or not. You will have a lot of independent variable, you will take a handful of relevant variables and build your Logistic Regression model which will further provide you a probability between 0 to 1 for each advertisement. In a normal case if the probability is between 0 to 0.5 then you change it to 0 else 1

-Logistic regression doesn’t require linear relationship between dependent and independent variables.  It can handle various types of relationships because it applies a non-linear log transformation to the predicted odds ratio
no multi collinearity
-It’s a multinomial logistic regression if the dependent variable is multiclass and not a binary class i.e. you are predicting for more the two variables
– It requires large sample sizes because maximum likelihood estimates are less powerful at low sample sizes than ordinary least square

How to check the performance of a Logistic Regression model?
1. Confusion matrix, a model as accurate as 99% can be of no use to the system, why? Think about it dumbo 😛
2. Likelihood Test – A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors.

Ridge Regression

Now you are done with Linear and Logistic Regression, time to check some other regression techniques.
Ridge regression is used to analyze multiple regression lines that suffers from multicollinearity

Let’s Suppose that our regression equation is
Y = XB + e
where Y is the dependent variable, X is the independent variables, B is the regression coefficients to be estimated, and e is nothing but the errors/residuals.

In ridge regression, we first subtract the mean from the variable and then divide by their S.D. , this way the variables are standardized.

Ridge regression is a method that seeks to reduce the MSE by adding some bias and, at the same time, reducing the variance.
From an equation standpoint, you can think of this ordinary least squares as a method that seeks to find the coefficients that minimize the sum of the squares of the residuals. Ridge regression adds an additional term that needs to be minimized so when you are performing ridge regression you are minimizing the sum of the squares of the residuals as well as adding in a constraint on the sum of the squares of the regression coefficients. This second term, the sum of the squares of the regression coefficient is how the bias is introduced into the model.

Freedom of speech
Linear Regression ka equation hai Y = mX+c, lekin isme ek aur component aata hai i.e. error, so the equation becomes Y =mX+c+error, ye error v simple nai hota hai, isse prediction error khte hain aur ye error biasness aur variance ke kaaran hota hai. bias is nothing but the assumptions made during the point of time when the model is fitted to make an accurate prediction. Aur variance toh wahi hota hai jo predicted line se saare data point ka distance hota hai. Inhi dono part pe Ridge regression kaam krta hai. Utna hard hai nai, do baar pdhoge toh samjh aa jaeyga, aur nai aaya toh light le ke aage badho, sab thode na samjhna hota hai 😛
To solve bias and variance we have Least Square Sum and Lambda summation of Beta

4. Stepwise Regression

Stepwise Regression is fairly simple to understand. Basically there are two types of stepwise Regression:-
1. Forward Stepwise Regression – Here the model starts with zero predictor variable and gradually keep adding the variables while checking the performance of the model. It keeps on adding variables till it covers all of the variables

2. Backward Stepwise Regression – Start with all the variables and keeps on reducing the variables till zero variable

R code

 step(lm(Y~X1+X2+X3+X4,data=FlowerData),direction="forward")

5. Lasso Regression

Lasso is useful because it prevents overfitting on training dataset. The best part is that it penalizes the L1 normalization of weights found by the line of regression. L1 norm is the sum of the magnitudes of the vectors in a co ordinate space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.

One approach is to add  λ∑Mi=1|wi|λ∑i=1M|wi|  to the regression error.
The second approach is to minimize the regression error subject to the constraint ∑Mi=1|wi|≤η∑i=1M|wi|≤η
Both approaches can be shown to be equivalent using Lagrange multipliers.

Lasso is great for reducing the feature space, because when λ is sufficiently large, then many of the weights (wi) are driven to zero.

One disadvantage is that using Lasso doesn’t give closed-form analytic solutions and requires quadratic programming to solve.

Supporting Link – https://www.quora.com/What-is-the-LASSO-technique-for-regression

In the upcoming articles we will have detailed coding examples in R.

Try applying Logistic,Linear, and Stepwise Regression by your own 🙂

Keep Learning 🙂

The Data Monk