Home SQL 10 Days Series SQL Query Optimization and Data Modeling Interview Questions
SQL 10 Days Series

SQL Query Optimization and Data Modeling Interview Questions

SQL Series – This series aim to make sure you are able to learn and understand at least 100 of the most asked concepts in SQL interviews.
Today is Day 3 of the SQL Series – SQL Query Optimization and Data Modeling Interview Questions
Bookmark this page, and make sure to be consistent in it. Also, get a notebook and make notes of each question/concept put in here

We have created this series to make you practice and learn ~120 Most Asked SQL interview Questions

Day 1 – Basic SQL Questions + Most Asked SQL Interview Questions
Day 2 – JSON, Date and Time SQL Interview Questions
Day 3 – SQL Query Optimization and Data Modeling Interview Questions
Day 4 – Transactions and Concurrency
Day 5 – SQL Theory + SQL Hard Interview Questions
Day 6 – Advanced SQL Aggregation + Advanced Joins
Day 7 – Error Handling & Debugging + Stored Procedures & CTE
Day 8 – Complex Data Handling
Day 9 – Data Modeling
Day 10 – SQL Most Asked Query Interview Questions

The Data Monk Best Seller Books and Services

1. Become a Full Stack Analytics Professional in 2026 – 2200 Interview Questions covering 23 topics
2. Book mock interviews/Career Guidance Session on Top Mate to practice


And below is our master class

1 Month Complete 1:1 Mentorship along with 3 e-books, 4 mock interviews, 4 career guidance sessions, and a complete resume review

You can also go through our The Data Monk’s Youtube Channel with 160+ videos across all the top interview topics

Table of Contents

  1. Introduction
  2. What is SQL Query Optimization
  3. SQL Query Optimization Techniques
  4. Data Modeling Fundamentals
  5. FAQ
  6. Conclusion

Introduction

In modern data-driven organizations, the ability to write efficient SQL queries and design optimal database schemas is crucial for performance, scalability, and cost-efficiency. Whether you’re a data analyst, database engineer, or full-stack developer, mastering SQL query optimization and data modeling can significantly improve your system’s performance.
SQL Query Optimization and Data Modeling Interview Questions

This comprehensive guide covers:

  • SQL query optimization techniques with real-world examples
  • Data modeling strategies to structure your databases effectively
  • Common mistakes to avoid
  • Best practices used by top companies

What is SQL Query Optimization?

SQL query optimization is the process of improving the performance of SQL queries by reducing execution time, minimizing resource consumption (CPU, memory, disk I/O), and decreasing the overall load on the database server.

Why SQL Query Optimization Matters

  • Faster application response times – Users experience better performance
  • Reduced infrastructure costs – Less CPU and memory usage
  • Better scalability – Handle more concurrent users
  • Improved user experience – Reduce page load times
  • Cost savings – Especially important for cloud databases that charge per query

SQL Query Optimization Techniques: Detailed Solutions

Question 1: What is Query Execution Plan and How to Use It for Optimization?

Question

In a large e-commerce database with millions of orders, your inventory query is taking 45 seconds to return results. How would you use the query execution plan to identify the bottleneck and optimize the query?

Detailed Answer

A query execution plan (also called execution plan or query plan) is a visual representation of the steps the database engine takes to execute a SQL query. It shows:

  • Which tables are being scanned
  • Whether indexes are being used
  • Join operations and their efficiency
  • Cost estimates for each operation

Step-by-Step Approach:

Step 1: Get the Execution Plan

For SQL Server:

sql

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT o.order_id, o.order_date, p.product_name, inv.quantity
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN inventory inv ON p.product_id = inv.product_id
WHERE o.order_date >= '2024-01-01';

SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

For PostgreSQL:

sql

EXPLAIN ANALYZE
SELECT o.order_id, o.order_date, p.product_name, inv.quantity
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN inventory inv ON p.product_id = inv.product_id
WHERE o.order_date >= '2024-01-01';

Step 2: Analyze the Plan – Look For:

  • Seq Scan (Sequential Scan) – Red flag! The database is scanning entire tables
  • Index Scan vs Index Seek – Seek is better (more efficient)
  • Cost estimates – Higher numbers indicate inefficiency
  • Loop joins vs Hash joins – Hash joins are faster for large datasets
  • Nested loops – Can be expensive with large result sets

Step 3: Common Bottlenecks

ProblemCauseSolution
High I/O reading millions of rowsNo index on WHERE clause columnsAdd index on order_date
Nested loop join taking 50%+ of timeJoining large tables without indexAdd index on join columns
Multiple table scansNo indexes at allAnalyze and add strategic indexes
Network latencyRetrieving too many columnsUse column projection (SELECT specific columns)

Step 4: Optimized Query

sql

-- Original problematic query
SELECT o.order_id, o.order_date, p.product_name, inv.quantity
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN inventory inv ON p.product_id = inv.product_id
WHERE o.order_date >= '2024-01-01';

-- Optimized version with indexes
-- First, create strategic indexes:
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);
CREATE INDEX idx_products_id ON products(product_id);
CREATE INDEX idx_inventory_product ON inventory(product_id);

-- Then run the same query (execution plan will now show Index Seek instead of Scan)
SELECT o.order_id, o.order_date, p.product_name, inv.quantity
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN inventory inv ON p.product_id = inv.product_id
WHERE o.order_date >= '2024-01-01';

Expected Results:

  • Query time: 45 seconds → 2-3 seconds (93% improvement!)
  • Logical reads: 500,000+ → 50,000 (90% reduction!)

Key Takeaway: Always analyze the execution plan before optimizing. It reveals exactly where the database is spending time.


Question 2: How to Use Indexes Effectively Without Slowing Down Inserts?

Question

You have a table that receives 10,000 INSERT operations per hour. You need to optimize SELECT queries that take 30 seconds, but adding indexes slows down inserts by 50%. How would you balance read and write performance?

Detailed Answer

The Challenge: Indexes improve SELECT performance but slow down INSERT, UPDATE, and DELETE operations because the database must update every index when data changes.

Solution: Strategic Indexing Approach

Step 1: Understand Index Trade-offs

OperationWithout IndexWith Index
SELECT (WHERE clause)Table Scan (Slow)Index Seek (Fast)
INSERTAdd 1 rowAdd 1 row + update index
UPDATEUpdate rowUpdate row + update index
DELETEDelete rowDelete row + remove from index

Step 2: Choose the Right Type of Index

sql

-- CLUSTERED INDEX (most impactful)
-- Only 1 per table, reorganizes entire table data
CREATE CLUSTERED INDEX idx_orders_pk 
ON orders(order_id);

-- NON-CLUSTERED INDEX (flexible)
-- Multiple per table, separate structure
CREATE NONCLUSTERED INDEX idx_orders_date 
ON orders(order_date)
INCLUDE (customer_id, total_amount);  -- INCLUDE columns you frequently SELECT

-- COMPOSITE INDEX (combines multiple columns)
CREATE NONCLUSTERED INDEX idx_orders_composite 
ON orders(order_date, customer_id)
WHERE order_status = 'completed';  -- Filtered index only indexes relevant rows

Step 3: Use Filtered Indexes to Reduce Index Size

sql

-- BAD: Index on all 10 million rows
CREATE INDEX idx_orders_date ON orders(order_date);

-- GOOD: Index only on active orders (2 million rows)
CREATE INDEX idx_orders_date_active 
ON orders(order_date) 
WHERE order_status IN ('pending', 'in_transit');

-- Benefit: Smaller index = faster inserts, faster queries

Step 4: Batch Inserts to Reduce Index Overhead

sql

-- BAD: Individual inserts (slow with indexes)
INSERT INTO orders VALUES (...);
INSERT INTO orders VALUES (...);
INSERT INTO orders VALUES (...);

-- GOOD: Batch insert (much faster)
INSERT INTO orders (order_id, customer_id, order_date, total_amount)
VALUES 
  (1001, 101, '2024-01-15', 500),
  (1002, 102, '2024-01-15', 750),
  (1003, 103, '2024-01-15', 1200),
  (1004, 104, '2024-01-15', 350);

Step 5: Monitor and Maintain Indexes

sql

-- Find fragmented indexes (become less efficient over time)
SELECT 
  OBJECT_NAME(ps.object_id) AS TableName,
  i.name AS IndexName,
  ps.avg_fragmentation_in_percent AS Fragmentation
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') ps
JOIN sys.indexes i ON ps.object_id = i.object_id AND ps.index_id = i.index_id
WHERE ps.avg_fragmentation_in_percent > 10;

-- Rebuild highly fragmented indexes (> 30%)
ALTER INDEX idx_orders_date ON orders REBUILD;

-- Reorganize moderately fragmented indexes (10-30%)
ALTER INDEX idx_orders_date ON orders REORGANIZE;

Real-World Example: 10,000 Inserts/Hour

sql

-- BALANCED APPROACH:

-- 1. Clustered index on primary key (almost always needed)
CREATE CLUSTERED INDEX PK_orders ON orders(order_id);

-- 2. Few critical non-clustered indexes for frequent queries
CREATE NONCLUSTERED INDEX idx_orders_date ON orders(order_date);
CREATE NONCLUSTERED INDEX idx_orders_customer ON orders(customer_id);

-- 3. Use filtered indexes for hot queries
CREATE NONCLUSTERED INDEX idx_pending_orders 
ON orders(order_date) 
WHERE order_status = 'pending';

-- 4. Batch inserts (application level)
-- Instead of 10,000 individual inserts, do 100 batches of 100
BEGIN TRANSACTION;
INSERT INTO orders (...) VALUES (...), (...), ..., (...);  -- 100 rows
COMMIT;
-- Repeat 100 times per hour

Expected Performance:

  • SELECT query time: 30 seconds → 1-2 seconds ✅
  • INSERT performance impact: 50% slowdown → 5-10% slowdown ✅

Key Takeaway: Use filtered indexes and batch inserts to balance read and write performance. Not every column needs an index.


Question 3: How to Optimize Joins – When to Use Different Join Types?

Question

You have a query joining 5 tables with 10 million+ rows each. The query takes 2 minutes. How would you optimize it by choosing the right join types and order?

Detailed Answer

Join Types and Their Performance Characteristics:

sql

-- 1. INNER JOIN (fastest, excludes unmatched rows)
SELECT o.order_id, c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- Use when: You only need matching records

-- 2. LEFT JOIN (slower, includes unmatched rows from left table)
SELECT o.order_id, c.customer_name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;
-- Use when: You need all orders even if customer is null

-- 3. CROSS JOIN (very slow, Cartesian product)
SELECT * FROM orders CROSS JOIN customers;
-- Use when: You need all combinations (rarely!)

-- 4. HASH JOIN vs LOOP JOIN vs MERGE JOIN
-- Database chooses automatically, but you can guide it

Step 1: Identify the Optimal Join Order

sql

-- SLOW: Joining 5 tables inefficiently
-- Starts with largest table (orders 10M rows)
SELECT 
  o.order_id, 
  c.customer_name, 
  p.product_name, 
  s.store_name, 
  inv.quantity
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN stores s ON o.store_id = s.store_id
JOIN inventory inv ON p.product_id = inv.product_id
WHERE o.order_date >= '2024-01-01'
  AND s.country = 'USA'
  AND p.category = 'Electronics';

-- OPTIMIZED: Use smaller tables first
-- Join strategy: Filter first, then join
WITH filtered_orders AS (
  SELECT o.order_id, o.customer_id, o.store_id
  FROM orders o
  WHERE o.order_date >= '2024-01-01'
),
filtered_stores AS (
  SELECT store_id, store_name
  FROM stores
  WHERE country = 'USA'
),
filtered_products AS (
  SELECT product_id, product_name, category
  FROM products
  WHERE category = 'Electronics'
)
SELECT 
  fo.order_id, 
  c.customer_name, 
  fp.product_name, 
  fs.store_name, 
  inv.quantity
FROM filtered_orders fo
JOIN customers c ON fo.customer_id = c.customer_id
JOIN filtered_stores fs ON fo.store_id = fs.store_id
JOIN order_items oi ON fo.order_id = oi.order_id
JOIN filtered_products fp ON oi.product_id = fp.product_id
JOIN inventory inv ON fp.product_id = inv.product_id;

Step 2: Reduce Data Early with WHERE Clauses

sql

-- SLOW: Joins all rows, then filters
-- 10M + 10M + 10M + 10M + 10M = 50M row joins before filtering
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON ...
JOIN stores s ON ...
JOIN inventory inv ON ...
WHERE o.order_date >= '2024-01-01'
  AND s.country = 'USA'
  AND p.category = 'Electronics';

-- FAST: Filter before joining
-- Apply WHERE on base tables first
SELECT *
FROM orders o
WHERE o.order_date >= '2024-01-01'
JOIN customers c ON o.customer_id = c.customer_id
JOIN (SELECT * FROM products WHERE category = 'Electronics') p ON ...
JOIN (SELECT * FROM stores WHERE country = 'USA') s ON ...
JOIN inventory inv ON ...;

Step 3: Ensure Join Columns Are Indexed

sql

-- CRITICAL: Ensure join columns have indexes
CREATE NONCLUSTERED INDEX idx_orders_customer_id ON orders(customer_id);
CREATE NONCLUSTERED INDEX idx_orders_store_id ON orders(store_id);
CREATE NONCLUSTERED INDEX idx_order_items_order_id ON order_items(order_id);
CREATE NONCLUSTERED INDEX idx_order_items_product_id ON order_items(product_id);
CREATE NONCLUSTERED INDEX idx_inventory_product_id ON inventory(product_id);
CREATE NONCLUSTERED INDEX idx_customers_id ON customers(customer_id);
CREATE NONCLUSTERED INDEX idx_products_id ON products(product_id);
CREATE NONCLUSTERED INDEX idx_stores_id ON stores(store_id);

-- Without these indexes, the database does full table scans for each join

Performance Impact:

  • Original query: 2 minutes (120 seconds)
  • Optimized query: 3-5 seconds
  • Improvement: 96% faster! ✅

Key Takeaway: The join order matters! Start with smaller, filtered tables and ensure all join columns are indexed.


Question 4: How to Optimize Subqueries and CTEs?

Question

You have a query with multiple nested subqueries that calculates top customers, their purchases, and refunds. The query takes 45 seconds. How would you optimize it using CTEs and materialization?

Detailed Answer

Subqueries vs CTEs vs Temporary Tables:

sql

-- 1. SUBQUERY (runs multiple times, slow)
SELECT 
  customer_id,
  customer_name,
  (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) as order_count,
  (SELECT SUM(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) as total_spent
FROM customers c
WHERE customer_id IN (
  SELECT customer_id 
  FROM orders 
  GROUP BY customer_id 
  HAVING COUNT(*) > 10
);

-- Problem: Subqueries in SELECT clause run for EVERY row (slow!)

-- 2. CTE (Common Table Expression - reads data once, faster)
WITH top_customers AS (
  SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(total_amount) as total_spent
  FROM orders
  GROUP BY customer_id
  HAVING COUNT(*) > 10
)
SELECT 
  c.customer_id,
  c.customer_name,
  tc.order_count,
  tc.total_spent
FROM customers c
JOIN top_customers tc ON c.customer_id = tc.customer_id;

-- Better! CTE reads data once, then joins

-- 3. TEMPORARY TABLE (physical storage, even faster for complex operations)
CREATE TEMP TABLE top_customers_temp AS
SELECT 
  customer_id,
  COUNT(*) as order_count,
  SUM(total_amount) as total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 10;

SELECT 
  c.customer_id,
  c.customer_name,
  tc.order_count,
  tc.total_spent
FROM customers c
JOIN top_customers_temp tc ON c.customer_id = tc.customer_id;

DROP TABLE top_customers_temp;

Step 1: Convert Nested Subqueries to CTEs

sql

-- SLOW: Nested subqueries
SELECT 
  o.order_id,
  o.customer_id,
  (SELECT AVG(total_amount) FROM orders 
   WHERE customer_id = o.customer_id) as customer_avg_order_value,
  (SELECT COUNT(*) FROM refunds r 
   WHERE r.order_id = o.order_id) as refund_count,
  (SELECT SUM(amount) FROM refunds r 
   WHERE r.order_id = o.order_id) as refund_amount
FROM orders o
WHERE o.order_date >= '2024-01-01';

-- FAST: Use CTEs
WITH customer_metrics AS (
  SELECT 
    customer_id,
    AVG(total_amount) as customer_avg_order_value
  FROM orders
  GROUP BY customer_id
),
refund_metrics AS (
  SELECT 
    order_id,
    COUNT(*) as refund_count,
    SUM(amount) as refund_amount
  FROM refunds
  GROUP BY order_id
)
SELECT 
  o.order_id,
  o.customer_id,
  cm.customer_avg_order_value,
  COALESCE(rm.refund_count, 0) as refund_count,
  COALESCE(rm.refund_amount, 0) as refund_amount
FROM orders o
JOIN customer_metrics cm ON o.customer_id = cm.customer_id
LEFT JOIN refund_metrics rm ON o.order_id = rm.order_id
WHERE o.order_date >= '2024-01-01';

Step 2: Use Multiple CTEs Instead of Nested Subqueries

sql

-- Clear, maintainable, and optimized
WITH 
orders_filtered AS (
  SELECT order_id, customer_id, total_amount, order_date
  FROM orders
  WHERE order_date >= '2024-01-01'
),
customer_stats AS (
  SELECT 
    customer_id,
    COUNT(*) as lifetime_orders,
    AVG(total_amount) as avg_order_value,
    MAX(order_date) as last_order_date
  FROM orders
  GROUP BY customer_id
),
refund_summary AS (
  SELECT 
    order_id,
    COUNT(*) as refund_count,
    SUM(amount) as total_refunded
  FROM refunds
  GROUP BY order_id
)
SELECT 
  of.order_id,
  of.customer_id,
  cs.lifetime_orders,
  cs.avg_order_value,
  cs.last_order_date,
  COALESCE(rs.refund_count, 0) as refunds,
  COALESCE(rs.total_refunded, 0) as refund_amount
FROM orders_filtered of
JOIN customer_stats cs ON of.customer_id = cs.customer_id
LEFT JOIN refund_summary rs ON of.order_id = rs.order_id;

Performance Comparison:

  • Nested subqueries: 45 seconds
  • CTEs: 3-5 seconds
  • Improvement: 90% faster! ✅

Key Takeaway: CTEs are more readable and performant than nested subqueries. Always use CTEs for complex logic.


Question 5: How to Optimize GROUP BY and Aggregation Queries?

Question

Your daily sales report query that groups 100 million rows of transaction data takes 3 minutes to run. What techniques would you use to optimize this aggregation query?

Detailed Answer

Challenge: GROUP BY operations on large datasets require:

  1. Sorting all rows
  2. Aggregating by groups
  3. This can be very resource-intensive

Solution: Multi-Layered Optimization

Step 1: Ensure Columns are Indexed

sql

-- SLOW: No indexes on GROUP BY columns
SELECT 
  DATE(transaction_date) as date,
  product_category,
  payment_method,
  COUNT(*) as transaction_count,
  SUM(amount) as total_sales,
  AVG(amount) as avg_transaction
FROM transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY DATE(transaction_date), product_category, payment_method;

-- FAST: Create strategic indexes
CREATE NONCLUSTERED INDEX idx_transactions_date_category 
ON transactions(transaction_date, product_category, payment_method)
INCLUDE (amount);

-- Now same query runs much faster
SELECT 
  DATE(transaction_date) as date,
  product_category,
  payment_method,
  COUNT(*) as transaction_count,
  SUM(amount) as total_sales,
  AVG(amount) as avg_transaction
FROM transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY DATE(transaction_date), product_category, payment_method;

Step 2: Filter Data Early

sql

-- SLOW: Aggregate all data, then filter
SELECT 
  DATE(transaction_date) as date,
  product_category,
  COUNT(*) as count
FROM transactions
GROUP BY DATE(transaction_date), product_category
HAVING COUNT(*) > 1000;
-- This processes ALL 100M rows before filtering

-- FAST: Filter WHERE, then aggregate
SELECT 
  DATE(transaction_date) as date,
  product_category,
  COUNT(*) as count
FROM transactions
WHERE transaction_date >= '2024-01-01' -- Filters to 10M rows
GROUP BY DATE(transaction_date), product_category
HAVING COUNT(*) > 1000;
-- Now processes only 10M rows

Step 3: Pre-aggregate or Materialize Results

sql

-- For frequently run reports, pre-calculate and store results

-- 1. Create a summary table
CREATE TABLE daily_sales_summary (
  date DATE,
  product_category VARCHAR(100),
  payment_method VARCHAR(100),
  transaction_count INT,
  total_sales DECIMAL(15,2),
  avg_transaction DECIMAL(10,2),
  PRIMARY KEY (date, product_category, payment_method)
);

-- 2. Populate it once daily (off-peak hours)
INSERT INTO daily_sales_summary
SELECT 
  DATE(transaction_date),
  product_category,
  payment_method,
  COUNT(*) as transaction_count,
  SUM(amount) as total_sales,
  AVG(amount) as avg_transaction
FROM transactions
WHERE DATE(transaction_date) = CURRENT_DATE - 1
GROUP BY DATE(transaction_date), product_category, payment_method;

-- 3. Query the summary instead (instant!)
SELECT * 
FROM daily_sales_summary
WHERE date >= '2024-01-01'
ORDER BY date DESC;

-- Result: Query time 3 minutes → < 100ms ✅

Step 4: Use Approximate Aggregations for Large Datasets

sql

-- When exact numbers aren't critical, use APPROXIMATE COUNT DISTINCT
-- (Available in SQL Server, PostgreSQL, etc.)

-- SLOW: Exact count (uses temp storage)
SELECT 
  COUNT(DISTINCT customer_id) as unique_customers
FROM transactions
WHERE transaction_date >= '2024-01-01';

-- FAST: Approximate count (uses HyperLogLog algorithm)
SELECT 
  APPROX_COUNT_DISTINCT(customer_id) as approx_unique_customers
FROM transactions
WHERE transaction_date >= '2024-01-01';

-- Tradeoff: May be off by 1-2%, but 10x faster

Step 5: Optimize GROUP BY Order

sql

-- GROUP BY order matters! Put low-cardinality columns first
-- (columns with fewer unique values)

-- SLOW: High-cardinality column first
SELECT 
  transaction_id,    -- 100M unique values (HIGH cardinality)
  DATE(transaction_date),
  COUNT(*) as count
FROM transactions
GROUP BY transaction_id, DATE(transaction_date);

-- FAST: Low-cardinality column first
SELECT 
  DATE(transaction_date),  -- ~1000 unique values (LOW cardinality)
  payment_method,          -- ~10 unique values (LOWEST cardinality)
  COUNT(*) as count
FROM transactions
GROUP BY DATE(transaction_date), payment_method;

Real-World Optimization Summary:

TechniqueImpactEffort
Add indexes50-70% fasterLow
Filter WHERE early30-50% fasterLow
Materialize results95%+ fasterMedium
Approximate aggregations80-90% fasterMedium
Optimize GROUP BY order10-20% fasterLow

Final Result:

  • Original query: 3 minutes (180 seconds)
  • With all optimizations: 1-2 seconds
  • Improvement: 99% faster! ✅

Key Takeaway: For large aggregation queries, materialize results into summary tables instead of recalculating from raw data every time.


Data Modeling Fundamentals

Question 1: What is Normalization vs Denormalization? When to Use Each?

Question

You’re designing a database for an e-commerce platform. Should you normalize the database to minimize redundancy, or denormalize for query performance? What are the tradeoffs?

Detailed Answer

Normalization: Organize Data to Reduce Redundancy

Normalization means breaking data into multiple tables to eliminate duplication.

sql

-- NOT NORMALIZED (Denormalized) - Customer info repeated in every order
CREATE TABLE orders_denormalized (
  order_id INT,
  customer_id INT,
  customer_name VARCHAR(100),      -- ❌ Repeated for each order
  customer_email VARCHAR(100),     -- ❌ Repeated for each order
  customer_phone VARCHAR(20),      -- ❌ Repeated for each order
  customer_address VARCHAR(255),   -- ❌ Repeated for each order
  order_date DATE,
  total_amount DECIMAL(10,2)
);
-- Problem: If customer email changes, must update ALL their orders

-- NORMALIZED (3rd Normal Form) - Data split across tables
CREATE TABLE customers (
  customer_id INT PRIMARY KEY,
  customer_name VARCHAR(100),
  customer_email VARCHAR(100),
  customer_phone VARCHAR(20),
  customer_address VARCHAR(255)
);

CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  order_date DATE,
  total_amount DECIMAL(10,2),
  FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

-- Benefit: Update customer email once, reflects in all queries

Normalization Levels (0-3NF):

LevelDescriptionExample
UnnormalizedData in one table, duplicates everywhereAll customer info in every order row
1NF (First)Remove repeating groupsProduct list is separate from order
2NF (Second)Remove partial dependenciesOrder details in separate table
3NF (Third)Remove transitive dependenciesCustomer phone in customer table, not order

Denormalization: Optimize for Query Performance

Denormalization means keeping some redundant data to avoid joins.

sql

-- DENORMALIZED for speed
SELECT order_id, customer_name, total_amount
FROM orders_denormalized
WHERE order_date >= '2024-01-01';
-- ✅ No joins needed, very fast!

-- NORMALIZED (requires join)
SELECT o.order_id, c.customer_name, o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';
-- ❌ Requires join, slower

Decision Framework: Normalize vs Denormalize?

Use NORMALIZATION (3NF) When:

  • ✅ Data is frequently written (INSERT, UPDATE, DELETE)
  • ✅ Consistency is critical
  • ✅ Storage space is limited
  • ✅ Database is small to medium size
  • ✅ Real-time accuracy is important

Use DENORMALIZATION When:

  • ✅ Data is frequently read (SELECT)
  • ✅ Write operations are rare
  • ✅ Query performance is critical
  • ✅ You have plenty of storage
  • ✅ Analytics or reporting database
  • ✅ You have a Data Warehouse or Data Lake

Practical E-Commerce Example:

sql

-- OPERATIONAL DATABASE (Normalized - for transactions)
CREATE TABLE customers (
  customer_id INT PRIMARY KEY,
  name VARCHAR(100),
  email VARCHAR(100),
  phone VARCHAR(20)
);

CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  order_date DATE,
  FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

CREATE TABLE order_items (
  order_item_id INT PRIMARY KEY,
  order_id INT,
  product_id INT,
  quantity INT,
  FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

-- ANALYTICS DATABASE (Denormalized - for reporting)
CREATE TABLE order_summary (
  order_id INT,
  customer_id INT,
  customer_name VARCHAR(100),        -- Denormalized from customers
  customer_email VARCHAR(100),       -- Denormalized from customers
  order_date DATE,
  total_items INT,                   -- Pre-calculated from order_items
  total_amount DECIMAL(10,2),        -- Pre-calculated
  PRIMARY KEY (order_id)
);

Key Takeaway: Use normalized databases for operational systems (OLTP) and denormalized databases for analytics (OLAP).


Question 2: How to Design an Effective Database Schema Using Star Schema and Snowflake Schema?

Question

You’re building a data warehouse for analytics that needs to support millions of queries on sales, customers, products, and time dimensions. Should you use a Star Schema or Snowflake Schema? What are the differences and when to use each?

Detailed Answer

Star Schema: Simple and Fast

Star schema has one fact table (center) and multiple dimension tables (surrounding it).

sql

-- FACT TABLE: Contains metrics and foreign keys to dimensions
CREATE TABLE fact_sales (
  sale_id INT PRIMARY KEY,
  time_id INT,
  customer_id INT,
  product_id INT,
  store_id INT,
  quantity INT,
  amount DECIMAL(10,2),
  discount DECIMAL(5,2),
  FOREIGN KEY (time_id) REFERENCES dim_time(time_id),
  FOREIGN KEY (customer_id) REFERENCES dim_customer(customer_id),
  FOREIGN KEY (product_id) REFERENCES dim_product(product_id),
  FOREIGN KEY (store_id) REFERENCES dim_store(store_id)
);

-- DIMENSION TABLES: Descriptive attributes (denormalized)
CREATE TABLE dim_time (
  time_id INT PRIMARY KEY,
  date DATE,
  year INT,
  month INT,
  day INT,
  quarter INT,
  day_of_week VARCHAR(10),
  is_weekend BOOLEAN,
  holiday_name VARCHAR(50)
);

CREATE TABLE dim_customer (
  customer_id INT PRIMARY KEY,
  name VARCHAR(100),
  email VARCHAR(100),
  city VARCHAR(50),
  state VARCHAR(50),
  country VARCHAR(50),
  customer_segment VARCHAR(50),  -- Denormalized (could be separate table)
  lifetime_value DECIMAL(12,2)   -- Denormalized (calculated once)
);

CREATE TABLE dim_product (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100),
  category VARCHAR(50),
  subcategory VARCHAR(50),
  brand VARCHAR(50),
  price DECIMAL(10,2),
  supplier_id INT  -- Denormalized for speed
);

CREATE TABLE dim_store (
  store_id INT PRIMARY KEY,
  store_name VARCHAR(100),
  city VARCHAR(50),
  state VARCHAR(50),
  country VARCHAR(50),
  region VARCHAR(50)
);

-- ANALYTICAL QUERY (Simple!)
SELECT 
  t.year,
  t.month,
  c.country,
  p.category,
  COUNT(*) as transaction_count,
  SUM(fs.amount) as total_sales,
  AVG(fs.amount) as avg_sale
FROM fact_sales fs
JOIN dim_time t ON fs.time_id = t.time_id
JOIN dim_customer c ON fs.customer_id = c.customer_id
JOIN dim_product p ON fs.product_id = p.product_id
WHERE t.year = 2024
GROUP BY t.year, t.month, c.country, p.category;

-- Result: Simple joins, fast queries!

Snowflake Schema: Normalized Dimensions

Snowflake extends star schema by normalizing dimension tables.

sql

-- SNOWFLAKE: Same fact table, but dimensions are normalized
CREATE TABLE dim_product (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100),
  category_id INT,
  brand_id INT,
  price DECIMAL(10,2),
  FOREIGN KEY (category_id) REFERENCES dim_category(category_id),
  FOREIGN KEY (brand_id) REFERENCES dim_brand(brand_id)
);

CREATE TABLE dim_category (
  category_id INT PRIMARY KEY,
  category_name VARCHAR(50),
  department VARCHAR(50)
);

CREATE TABLE dim_brand (
  brand_id INT PRIMARY KEY,
  brand_name VARCHAR(50),
  country VARCHAR(50)
);

-- SNOWFLAKE QUERY (Requires more joins)
SELECT 
  t.year,
  t.month,
  c.country,
  cat.category_name,
  b.brand_name,
  COUNT(*) as transaction_count,
  SUM(fs.amount) as total_sales
FROM fact_sales fs
JOIN dim_time t ON fs.time_id = t.time_id
JOIN dim_product p ON fs.product_id = p.product_id
JOIN dim_category cat ON p.category_id = cat.category_id
JOIN dim_brand b ON p.brand_id = b.brand_id
JOIN dim_customer c ON fs.customer_id = c.customer_id
WHERE t.year = 2024
GROUP BY t.year, t.month, c.country, cat.category_name, b.brand_name;

-- Result: More joins, but saves storage space

Star vs Snowflake Comparison:

FactorStar SchemaSnowflake Schema
ComplexitySimple (4 dimensions)Complex (8+ dimensions)
Query PerformanceFast (fewer joins)Slower (more joins)
Storage10GB for data7GB for data
Update EaseUpdate once per dimensionUpdate in multiple tables
Query Speed⭐⭐⭐⭐⭐ Very Fast⭐⭐⭐ Moderate
Best ForOLAP, Analytics, BILarge enterprises, strict normalization

When to Use Star Schema:

  • ✅ Analytics and BI reporting (most common)
  • ✅ When query speed is critical
  • ✅ Smaller to medium data warehouses
  • ✅ Real-time dashboards

When to Use Snowflake Schema:

  • ✅ Very large data warehouses (100TB+)
  • ✅ When storage space is expensive
  • ✅ When data integrity is stricter priority than speed
  • ✅ Enterprise data warehouses

Practical Recommendation for E-Commerce:

sql

-- START WITH STAR SCHEMA (simpler, faster)
-- Fact table with 5-10 dimensions

-- As warehouse grows and queries become complex,
-- Consider slowly evolving to Snowflake for specific dimensions

-- Example: Keep product dimension denormalized (star),
-- but normalize location data (snowflake) since many tables reference it

Key Takeaway: Star schema is faster for analytics. Use it by default unless you have specific reasons for normalization (data integrity needs, massive scale, etc.).


Frequently Asked Questions

Q: What’s the difference between indexing and materialization?

Indexing: Creates a lookup structure on specific columns to speed up WHERE clauses and JOINs.

  • Index on customer_id makes filtering by customer fast
  • Useful for: WHERE, JOIN, ORDER BY conditions

Materialization: Pre-calculates and stores results of expensive queries.

  • Create a summary table with daily sales by category
  • Store the result physically
  • Useful for: Expensive aggregations, frequently used results

Q: How often should I reorganize indexes?

For high-write tables: Monthly or quarterly For low-write tables: Quarterly or annually Monitor fragmentation using sys.dm_db_index_physical_stats and reorganize when > 10%

Q: What’s the most impactful SQL optimization technique?

Adding the right indexes typically gives the biggest performance gain (50-70% faster), followed by filtering data early in the query (WHERE clauses before JOINs).

Q: Can denormalization cause problems?

Yes! Risks include:

  • Data inconsistency (customer info stored in two places, one gets updated)
  • Higher storage costs
  • Complex INSERT/UPDATE/DELETE logic
  • Use carefully in analytical databases only

Q: What’s the ideal number of indexes per table?

There’s no magic number, but:

  • Too few: Queries are slow
  • Too many: Inserts/updates become slow
  • Sweet spot: 1-5 indexes per table on high-write tables
  • Analytical tables can have 10+ indexes since they’re rarely written to

Conclusion

SQL query optimization and data modeling are essential skills for building high-performance database systems. The key takeaways are:

  1. Always start with execution plans – Understand where the database is spending time
  2. Index strategically – Focus on WHERE clause columns and JOIN conditions
  3. Filter early, join later – Apply WHERE filters before JOINs
  4. Use CTEs over nested subqueries – More readable and performant
  5. Normalize transactional databases, denormalize analytical databases – Different systems have different needs
  6. Use Star Schema for analytics – Simple, fast, and effective
  7. Monitor and maintain – Fragmentation degrades performance over time

With these techniques, you can improve query performance by 90-95% without expensive hardware upgrades.


Want 2,200+ More Questions Like This?

The Data Monk has compiled the most comprehensive analytics interview question bank — SQL, Python, ML, Statistics & Case Studies. Built by 50+ experts from Amazon, Flipkart & OYO.

Scroll to Top