Preparing for the Zomato Data Analyst interview in 2026? This guide covers the most commonly asked Zomato Data Analyst interview questions across all three rounds — SQL and data analysis, Python and pandas, and business case study. Every question includes a complete model answer, what the interviewer is actually testing, and targeted follow-up questions that come up in real Zomato interviews. Whether you are a fresher or have 1–3 years of experience, this is the most complete prep resource for the Zomato DA role.
Python
Pandas
Case Study
Cohort Analysis
Tableau / BI
- Restaurant health scoring
- Hyperlocal demand estimation
- Cohort retention analysis
- Cuisine preference analytics
- Data cleaning pipelines
- Cohort retention in Python
- ML feature engineering
- Rating prediction model
- Cloud kitchen expansion
- Declining order frequency
- Hypothesis-driven thinking
- Metric definition
Recommended Resources
- 2,200+ questions · 23 topics
- SQL, Python, product analytics, case studies
- Company-wise question banks included
- 220+ questions · SQL advanced, data models, DBMS
- Window functions, CTEs, performance tuning
- Zomato-style restaurant data SQL problems
Round 1 — SQL & Data Analysis
| Questions | 5 |
| Duration | 40 minutes |
| Difficulty | Medium–Hard |
| Key Topics |
CTEs Window Functions Cohort Analysis Conditional Aggregation Geospatial Logic Subscription Analytics |
The supply quality team wants a composite health score for every restaurant to identify at-risk partners early. Build a restaurant health score combining: order acceptance rate (40%), average rating (30%), and on-time preparation rate (30%) where prep_time_mins < 20 is considered on time. Rank restaurants by health score within each city.
Table:
- orders(order_id, restaurant_id, user_id, order_date, status, rating, prep_time_mins) — status ∈ {placed, delivered, cancelled}
- restaurants(restaurant_id, name, city)
delivered / placed (not delivered / all)? Many candidates mix up the denominator. They also test whether you use NULLIF for division-by-zero safety and whether you can compose a weighted score using ROUND and window functions for ranking within city partitions.WITH restaurant_stats AS (
SELECT
o.restaurant_id,
r.city,
COUNT(CASE WHEN o.status = 'placed' THEN 1 END) AS total_placed,
COUNT(CASE WHEN o.status = 'delivered' THEN 1 END) AS total_delivered,
AVG(CASE WHEN o.status = 'delivered' THEN o.rating END) AS avg_rating,
COUNT(CASE WHEN o.status = 'delivered'
AND o.prep_time_mins < 20 THEN 1 END) AS on_time_count
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
GROUP BY o.restaurant_id, r.city
),
health_scores AS (
SELECT
restaurant_id,
city,
total_delivered,
avg_rating,
ROUND(
0.40 * (100.0 * total_delivered / NULLIF(total_placed, 0))
+ 0.30 * (COALESCE(avg_rating, 0) * 20)
+ 0.30 * (100.0 * on_time_count / NULLIF(total_delivered, 0)),
2) AS health_score
FROM restaurant_stats
)
SELECT
restaurant_id,
city,
total_delivered,
ROUND(avg_rating, 2) AS avg_rating,
health_score,
RANK() OVER (PARTITION BY city ORDER BY health_score DESC) AS city_rank
FROM health_scores
ORDER BY city, city_rank;
COALESCE(avg_rating, 0) for new restaurants with no delivered orders yet.- How would you add a time-decay factor so recent orders matter more than orders from 6 months ago?
- Which restaurants are in the bottom 10% of health score in each city? Write that query.
- If a restaurant has only 5 total orders, should its score be treated the same as one with 500 orders?
Zomato divides cities into 500m × 500m hexagons for supply planning. Using the orders table, calculate: (a) orders per hexagon per hour-of-week slot (168 slots = 24 hours × 7 days), (b) identify the top 20 hexagons by evening peak demand (6–9 PM), (c) find hexagons with high demand but zero restaurant coverage — defined as no restaurant within 1.5 km using a simplified flat-earth distance formula.
Tables:
- orders(order_id, user_id, lat, long, restaurant_id, order_datetime, gmv)
- restaurants(restaurant_id, name, lat, long)
FLOOR(lat / 0.0045) for ~500m), can compute 168 hour-of-week slots, and can write a cross-join distance filter. Very few candidates get part (c) without a hint.-- Part (a): Orders per hexagon per hour-of-week slot
WITH hexagon_demand AS (
SELECT
FLOOR(lat / 0.0045) AS hex_lat,
FLOOR(long / 0.0045) AS hex_lon,
EXTRACT(DOW FROM order_datetime) AS day_of_week, -- 0=Sun
EXTRACT(HOUR FROM order_datetime) AS hour_of_day,
(EXTRACT(DOW FROM order_datetime) * 24
+ EXTRACT(HOUR FROM order_datetime))::INT AS hour_of_week_slot,
COUNT(order_id) AS order_count,
SUM(gmv) AS total_gmv
FROM orders
GROUP BY hex_lat, hex_lon, day_of_week, hour_of_day, hour_of_week_slot
),
-- Part (b): Top 20 hexagons by evening peak demand (18:00–20:59)
evening_peak AS (
SELECT
hex_lat,
hex_lon,
SUM(order_count) AS evening_orders
FROM hexagon_demand
WHERE hour_of_day BETWEEN 18 AND 20
GROUP BY hex_lat, hex_lon
ORDER BY evening_orders DESC
LIMIT 20
),
-- Part (c): High-demand hexagons with no restaurant within 1.5 km
-- 1 degree lat ~ 111 km, 1 degree lon ~ 111 km * cos(lat)
-- 1.5 km ~ 0.0135 degrees
hex_centers AS (
SELECT
hex_lat,
hex_lon,
(hex_lat + 0.5) * 0.0045 AS center_lat,
(hex_lon + 0.5) * 0.0045 AS center_lon,
SUM(order_count) AS total_orders
FROM hexagon_demand
GROUP BY hex_lat, hex_lon
),
hex_with_restaurant AS (
SELECT DISTINCT h.hex_lat, h.hex_lon
FROM hex_centers h
JOIN restaurants r
ON ABS(h.center_lat - r.lat) < 0.0135
AND ABS(h.center_lon - r.long) < 0.0135
)
SELECT
h.hex_lat,
h.hex_lon,
h.total_orders,
h.center_lat,
h.center_lon
FROM hex_centers h
LEFT JOIN hex_with_restaurant hwr
ON h.hex_lat = hwr.hex_lat AND h.hex_lon = hwr.hex_lon
WHERE hwr.hex_lat IS NULL
AND h.total_orders > (SELECT PERCENTILE_CONT(0.75)
WITHIN GROUP (ORDER BY total_orders)
FROM hex_centers)
ORDER BY h.total_orders DESC;
FLOOR(lat / 0.0045) bins coordinates into ~500m grid cells (0.0045 degrees ≈ 500m). The simplified distance check ABS(lat1 - lat2) < 0.0135 approximates 1.5 km without the expensive SQRT haversine formula — acceptable for SQL interviews. The “high demand” threshold uses P75 to filter noise from low-volume hexagons.- How would you make this query work with actual hexagonal H3 grid IDs if Zomato stores them in the table?
- How would demand patterns differ between weekday and weekend for the same hexagon?
- If you had to recommend 5 new cloud kitchen locations, how would you use this analysis?
Build a 3-month rolling cohort retention table. For each monthly cohort (users whose first order was in that month), calculate what percentage are still ordering in months 1, 2, and 3 after their cohort month. Use CTEs and conditional aggregation.
Table:
- orders(order_id, user_id, order_date)
WITH first_order AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM orders
GROUP BY user_id
),
user_activity AS (
SELECT
o.user_id,
f.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
DATEDIFF('month', f.cohort_month,
DATE_TRUNC('month', o.order_date)) AS months_after
FROM orders o
JOIN first_order f ON o.user_id = f.user_id
WHERE DATEDIFF('month', f.cohort_month,
DATE_TRUNC('month', o.order_date)) BETWEEN 0 AND 3
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM first_order
GROUP BY cohort_month
),
retention_counts AS (
SELECT
cohort_month,
COUNT(DISTINCT CASE WHEN months_after = 0 THEN user_id END) AS m0,
COUNT(DISTINCT CASE WHEN months_after = 1 THEN user_id END) AS m1,
COUNT(DISTINCT CASE WHEN months_after = 2 THEN user_id END) AS m2,
COUNT(DISTINCT CASE WHEN months_after = 3 THEN user_id END) AS m3
FROM user_activity
GROUP BY cohort_month
)
SELECT
r.cohort_month,
cs.cohort_size,
ROUND(100.0 * r.m0 / NULLIF(cs.cohort_size, 0), 1) AS month_0_pct,
ROUND(100.0 * r.m1 / NULLIF(cs.cohort_size, 0), 1) AS month_1_pct,
ROUND(100.0 * r.m2 / NULLIF(cs.cohort_size, 0), 1) AS month_2_pct,
ROUND(100.0 * r.m3 / NULLIF(cs.cohort_size, 0), 1) AS month_3_pct
FROM retention_counts r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
ORDER BY r.cohort_month;
COUNT(DISTINCT user_id) inside CASE WHEN is the correct way to build a retention pivot in SQL — not a subquery for each month. Dividing by cohort_size (not by M0 count) gives absolute retention. M0 should always be 100% if the cohort definition is correct — if it’s not, there is a data quality issue.- How would you adjust this if you only want to count users who placed at least 2 orders in the activity month?
- Which cohort month shows the steepest drop-off from M0 to M1? How would you find it?
- How does this retention query change if the company defines “active” as ordering in any of the last 30 days rather than the calendar month?
Find: (a) Top 3 cuisines by order count per city, (b) Users who have ordered from 5 or more different cuisine types in the last 6 months (power users), (c) Cuisine pairs most often ordered together by the same user on the same day.
Tables:
- orders(order_id, user_id, restaurant_id, order_date)
- restaurants(restaurant_id, cuisine_type, city)
RANK() vs ROW_NUMBER() — you need RANK() or DENSE_RANK() to handle ties in top-3. Part (b) is straightforward COUNT(DISTINCT). Part (c) is a self-join on the same user + same day — a pattern that catches most candidates off guard. Producing cuisine pairs requires joining the table to itself.-- Part (a): Top 3 cuisines by order count per city
WITH city_cuisine_orders AS (
SELECT
r.city,
r.cuisine_type,
COUNT(o.order_id) AS order_count,
RANK() OVER (PARTITION BY r.city ORDER BY COUNT(o.order_id) DESC) AS rnk
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
GROUP BY r.city, r.cuisine_type
)
SELECT city, cuisine_type, order_count, rnk
FROM city_cuisine_orders
WHERE rnk <= 3
ORDER BY city, rnk;
-- Part (b): Power users — 5+ distinct cuisine types in last 6 months
SELECT
o.user_id,
COUNT(DISTINCT r.cuisine_type) AS cuisine_variety
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY o.user_id
HAVING COUNT(DISTINCT r.cuisine_type) >= 5
ORDER BY cuisine_variety DESC;
-- Part (c): Most common cuisine pairs ordered by same user on same day
WITH user_day_cuisines AS (
SELECT
o.user_id,
o.order_date,
r.cuisine_type
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
),
cuisine_pairs AS (
SELECT
a.user_id,
a.cuisine_type AS cuisine_a,
b.cuisine_type AS cuisine_b
FROM user_day_cuisines a
JOIN user_day_cuisines b
ON a.user_id = b.user_id
AND a.order_date = b.order_date
AND a.cuisine_type < b.cuisine_type -- avoid duplicates, enforce ordering
)
SELECT
cuisine_a,
cuisine_b,
COUNT(*) AS pair_count
FROM cuisine_pairs
GROUP BY cuisine_a, cuisine_b
ORDER BY pair_count DESC
LIMIT 20;
cuisine_a < cuisine_b condition in the self-join is critical — it ensures each pair appears only once (e.g., “Biryani+Pizza” not also “Pizza+Biryani”) and eliminates rows where a cuisine is paired with itself. This is the standard pattern for market basket analysis in SQL without needing a pivot or array operations.- How would you find the cuisine that has the highest growth rate month-over-month in the last quarter?
- If Zomato wants to send personalized cuisine recommendations, how would you use this data to build a simple rule-based recommendation?
- What if a restaurant offers multiple cuisines — how would you update the schema and query?
Zomato wants to evaluate whether the Gold subscription program is driving incremental value. Compare Gold vs non-Gold users on: average orders per month, average GMV per order, discount utilization rate, and churn rate (no order in the last 60 days). Use window functions for monthly order calculations.
Tables:
- users(user_id, subscription_type, subscription_start, subscription_end)
- orders(order_id, user_id, order_date, gmv, discount)
WITH order_segment AS (
-- Tag each order with whether the user was Gold at time of order
SELECT
o.order_id,
o.user_id,
o.order_date,
o.gmv,
o.discount,
CASE
WHEN u.subscription_type = 'Gold'
AND o.order_date BETWEEN u.subscription_start
AND COALESCE(u.subscription_end, CURRENT_DATE)
THEN 'Gold'
ELSE 'Non-Gold'
END AS segment
FROM orders o
JOIN users u ON o.user_id = u.user_id
),
monthly_orders AS (
SELECT
user_id,
segment,
DATE_TRUNC('month', order_date) AS order_month,
COUNT(order_id) AS monthly_order_count
FROM order_segment
GROUP BY user_id, segment, order_month
),
avg_monthly AS (
SELECT
segment,
AVG(monthly_order_count) AS avg_orders_per_month
FROM monthly_orders
GROUP BY segment
),
overall_metrics AS (
SELECT
segment,
AVG(gmv) AS avg_gmv_per_order,
AVG(CASE WHEN discount > 0 THEN 1.0 ELSE 0 END) AS discount_utilization_rate
FROM order_segment
GROUP BY segment
),
churn AS (
SELECT
u.user_id,
CASE
WHEN u.subscription_type = 'Gold'
THEN 'Gold' ELSE 'Non-Gold'
END AS segment,
CASE
WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '60 days'
OR MAX(o.order_date) IS NULL
THEN 1 ELSE 0
END AS is_churned
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, segment
),
churn_rate AS (
SELECT segment, AVG(is_churned) AS churn_rate
FROM churn
GROUP BY segment
)
SELECT
am.segment,
ROUND(am.avg_orders_per_month, 2) AS avg_orders_per_month,
ROUND(om.avg_gmv_per_order, 2) AS avg_gmv_per_order,
ROUND(om.discount_utilization_rate * 100, 1) AS discount_util_pct,
ROUND(cr.churn_rate * 100, 1) AS churn_rate_pct
FROM avg_monthly am
JOIN overall_metrics om ON am.segment = om.segment
JOIN churn_rate cr ON am.segment = cr.segment;
order_date BETWEEN subscription_start AND subscription_end. This correctly classifies each order at the time it was placed. Using COALESCE(subscription_end, CURRENT_DATE) handles currently active subscribers who have no end date. Discount utilization is a rate (% of orders using discount), not total discount amount — define your metric before writing the query.- How would you calculate incremental GMV — the additional GMV Gold users generate versus a comparable non-Gold cohort?
- If Gold users also get free delivery, how would you adjust the discount utilization metric?
- How would you test whether the Gold program causes higher order frequency, or whether high-frequency users self-select into Gold?
Round 2 — Python & Pandas
| Questions | 3 |
| Duration | 30 minutes |
| Difficulty | Medium–Hard |
| Key Topics |
Data Cleaning Cohort Analysis Pandas Pivot Seaborn Heatmap Random Forest Feature Importance |
You receive a CSV of restaurant reviews with the following quality issues: duplicate review_id, rating outside [1, 5], review_text containing only emojis or single words, reviewer_id is NaN for 12% of rows, and created_at in mixed date formats (ISO 8601 and DD/MM/YYYY). Write a complete Python cleaning pipeline. Output a clean DataFrame and a quality report dict with counts of issues found and fixed.
pd.to_datetime(errors='coerce') to handle mixed formats, whether you use regex to detect emoji-only or single-word text, and whether you track issue counts in a structured way for auditability.import pandas as pd
import re
def clean_reviews(filepath: str) -> tuple[pd.DataFrame, dict]:
"""Load, audit, and clean restaurant reviews CSV."""
df = pd.read_csv(filepath)
original_len = len(df)
report = {}
# ── 1. Duplicate review_id ──────────────────────────
dupes = df.duplicated(subset='review_id', keep='first').sum()
df = df.drop_duplicates(subset='review_id', keep='first')
report['duplicate_review_ids_removed'] = int(dupes)
# ── 2. Rating out of range ──────────────────────────
invalid_rating = (~df['rating'].between(1, 5)).sum()
df = df[df['rating'].between(1, 5)].copy()
report['invalid_ratings_removed'] = int(invalid_rating)
# ── 3. Low-quality review text ──────────────────────
# Emoji-only: no ASCII letters; single-word: no whitespace after strip
EMOJI_PATTERN = re.compile(
r'^[\U00010000-\U0010ffff\U00002600-\U000027BF\U0001F300-\U0001F9FF\s]+$'
)
def is_low_quality(text) -> bool:
if pd.isna(text):
return True
text = str(text).strip()
if len(text.split()) <= 1:
return True
if EMOJI_PATTERN.match(text):
return True
return False
low_quality_mask = df['review_text'].apply(is_low_quality)
low_quality_count = low_quality_mask.sum()
df = df[~low_quality_mask].copy()
report['low_quality_reviews_removed'] = int(low_quality_count)
# ── 4. Missing reviewer_id ──────────────────────────
null_reviewer = df['reviewer_id'].isna().sum()
df = df.dropna(subset=['reviewer_id']).copy()
report['null_reviewer_ids_removed'] = int(null_reviewer)
# ── 5. Mixed date formats ────────────────────────────
def parse_date(val):
if pd.isna(val):
return pd.NaT
# Try ISO first, then DD/MM/YYYY
for fmt in ('%Y-%m-%dT%H:%M:%S', '%Y-%m-%d', '%d/%m/%Y'):
try:
return pd.to_datetime(val, format=fmt)
except (ValueError, TypeError):
pass
return pd.NaT
df['created_at'] = df['created_at'].apply(parse_date)
unparseable_dates = df['created_at'].isna().sum()
df = df.dropna(subset=['created_at']).copy()
report['unparseable_dates_removed'] = int(unparseable_dates)
report['original_rows'] = original_len
report['clean_rows'] = len(df)
report['total_rows_removed'] = original_len - len(df)
df = df.reset_index(drop=True)
return df, report
# Usage
clean_df, quality_report = clean_reviews('restaurant_reviews.csv')
print(quality_report)
dayfirst=True which can misparse ISO dates. The emoji regex targets Unicode blocks; do not rely on a simple isascii() check because many emoji fall in supplementary planes not captured by ASCII range alone.- How would you handle reviewer_id nulls by imputing from other reviews by the same device_id if that column exists?
- How would you make this pipeline idempotent so running it twice gives the same result?
- How would you unit-test the
is_low_qualityfunction with pytest?
Given a DataFrame df with columns [user_id, order_date, gmv], write Python code to: (a) assign cohort month (user’s first order month), (b) calculate cohort age (months since cohort month), (c) build a pivot table — rows = cohort_month, cols = cohort_age, values = retention_rate, and (d) plot a heatmap where color intensity represents retention rate. Use only pandas and matplotlib/seaborn.
groupby + transform('min') to assign cohort months efficiently (not a slow apply loop), can compute period differences using pd.Period arithmetic, and build a correct pivot table. The heatmap is a bonus — most candidates forget to annotate cells with retention percentages.import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Ensure order_date is datetime
df['order_date'] = pd.to_datetime(df['order_date'])
# ── (a) Cohort month: user's first order month ──────────
df['order_period'] = df['order_date'].dt.to_period('M')
df['cohort_month'] = (
df.groupby('user_id')['order_period']
.transform('min')
)
# ── (b) Cohort age: months since cohort month ───────────
df['cohort_age'] = (df['order_period'] - df['cohort_month']).apply(lambda x: x.n)
# ── (c) Cohort sizes and retention pivot ────────────────
cohort_sizes = (
df[df['cohort_age'] == 0]
.groupby('cohort_month')['user_id']
.nunique()
.rename('cohort_size')
)
retention_counts = (
df.groupby(['cohort_month', 'cohort_age'])['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id': 'active_users'})
)
retention_counts = retention_counts.merge(cohort_sizes, on='cohort_month')
retention_counts['retention_rate'] = (
retention_counts['active_users'] / retention_counts['cohort_size'] * 100
).round(1)
cohort_pivot = retention_counts.pivot_table(
index='cohort_month',
columns='cohort_age',
values='retention_rate'
)
cohort_pivot.index = cohort_pivot.index.astype(str)
# ── (d) Heatmap ─────────────────────────────────────────
fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(
cohort_pivot,
annot=True,
fmt='.1f',
cmap='YlOrRd_r',
linewidths=0.5,
linecolor='#f1f5f9',
cbar_kws={'label': 'Retention Rate (%)'},
ax=ax
)
ax.set_title('Zomato User Cohort Retention (Monthly)', fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Months Since First Order', fontsize=11)
ax.set_ylabel('Cohort Month', fontsize=11)
plt.tight_layout()
plt.savefig('cohort_retention_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print(cohort_pivot)
dt.to_period('M') converts dates to Period objects that support month subtraction natively — (period_a - period_b).n gives integer months between them. This avoids error-prone manual date math. transform('min') broadcasts the cohort month back to each row without a separate merge step, keeping the code concise.- How would you add a GMV-weighted retention metric alongside the user count retention?
- How would you identify cohorts that have significantly better M3 retention than average?
- If you wanted to show week-over-week retention for the first 8 weeks, what would change in the code?
Build a restaurant rating prediction model. Given features: [cuisine_type, avg_prep_time, num_items_on_menu, city_tier, avg_order_value, has_veg_only, weekend_rush_ratio], predict avg rating (1–5). Write Python code for: feature encoding, train-test split, model training using Random Forest, evaluation with MAE and MAPE, and identify the most important feature.
feature_importances_ correctly. They may ask why you chose Random Forest over linear regression.import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error
# ── Feature engineering ──────────────────────────────────
# df columns: cuisine_type, avg_prep_time, num_items_on_menu,
# city_tier, avg_order_value, has_veg_only,
# weekend_rush_ratio, avg_rating (target)
CATEGORICAL_COLS = ['cuisine_type', 'city_tier']
BOOLEAN_COLS = ['has_veg_only']
NUMERIC_COLS = ['avg_prep_time', 'num_items_on_menu',
'avg_order_value', 'weekend_rush_ratio']
TARGET = 'avg_rating'
df_model = df.copy()
# Encode categoricals with LabelEncoder
le = {}
for col in CATEGORICAL_COLS:
le[col] = LabelEncoder()
df_model[col] = le[col].fit_transform(df_model[col].astype(str))
# Boolean: convert True/False to 1/0
for col in BOOLEAN_COLS:
df_model[col] = df_model[col].astype(int)
FEATURES = CATEGORICAL_COLS + BOOLEAN_COLS + NUMERIC_COLS
X = df_model[FEATURES]
y = df_model[TARGET]
# ── Train-test split ─────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ── Model training ───────────────────────────────────────
# Random Forest — no feature scaling needed
rf = RandomForestRegressor(
n_estimators=200,
max_depth=8,
min_samples_leaf=5,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# ── Evaluation ───────────────────────────────────────────
y_pred = rf.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
# MAPE: avoid division by zero (ratings are always >= 1)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f"MAE : {mae:.3f}")
print(f"MAPE : {mape:.2f}%")
# ── Feature importance ───────────────────────────────────
importance_df = pd.DataFrame({
'feature' : FEATURES,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importances:")
print(importance_df.to_string(index=False))
top_feature = importance_df.iloc[0]['feature']
print(f"\nMost important feature: {top_feature}")
- Why did you choose Random Forest over linear regression or XGBoost for this task?
- How would you handle a restaurant that just launched with zero historical orders — its features would all be null or zero?
- How would you serve this model in production so Zomato’s backend can call it in real time?
Round 3 — Business Case Study
| Questions | 2 |
| Duration | 20 minutes |
| Difficulty | Medium |
| Key Topics |
Metric Definition Root Cause Analysis Hypothesis Testing Business Strategy Data Storytelling |
Zomato is considering launching cloud kitchens (delivery-only restaurants) in Pune. What data analysis would you do to: (a) select which 5 cuisines to launch, (b) identify the top 3 neighborhoods, and (c) set pricing? Define your data sources and analysis methodology.
Part (a) — Cuisine Selection:
I would analyze demand-supply gap rather than pure demand volume. The key data points needed are:
- Demand signal: Search queries and failed-search events in Pune for each cuisine — high search volume with low order conversion = unmet demand
- Current supply: Count of restaurants per cuisine type in Pune using Zomato’s restaurant database
- Order velocity: Orders per restaurant per day for each cuisine — cuisines with high velocity but few restaurants indicate underserved demand
- Margin profile: Average GMV and estimated food cost per cuisine (use historical discount patterns and food category benchmarks)
Decision criteria: rank cuisines by a composite score = (demand volume × 0.35) + (supply gap ratio × 0.40) + (margin potential × 0.25). Launch the top 5. Typically, Biryani, North Indian, Chinese, South Indian, and Desserts perform well in Tier-1 Indian cities.
Part (b) — Neighborhood Selection:
- Use the hyperlocal demand model (hexagon grid from Q2) to identify the top demand hexagons in Pune with low restaurant density
- Overlay with demographic data: office clusters, PG/hostel zones, and residential areas with high young-professional population have the highest delivery frequency
- Consider delivery time isochrones: select neighborhoods where a cloud kitchen can serve a 3 km radius within 30 minutes, covering maximum hexagons per kitchen
Top 3 target neighborhoods in Pune: Hinjewadi IT corridor (office demand), Viman Nagar (young professionals, dense), Kothrud (large residential, underserved by current supply).
Part (c) — Pricing:
- Baseline: Competitive pricing — benchmark against top-3 restaurants of the same cuisine in the same neighborhood (Zomato has this data internally)
- Adjustment: Since cloud kitchens have lower overhead (no dine-in), price 5–10% below market average to drive initial trial orders
- Dynamic pricing: Increase prices by 15–20% during peak demand slots (6–9 PM Friday–Sunday) where price elasticity is lower because users have fewer alternatives
- Minimum order value: Set at ₹199–₹249 to ensure delivery economics are positive; communicate clearly to avoid cart abandonment
- How would you measure whether the cloud kitchen launch was successful — what are your North Star and guardrail metrics?
- If the first cloud kitchen in Hinjewadi underperforms after 3 months, what’s your diagnostic framework?
- How would you decide when to add a second cuisine to an existing cloud kitchen location?
A cohort of users who joined 6 months ago is placing 40% fewer orders than the same cohort from 12 months ago did at the same stage of their lifecycle. Walk through your diagnostic framework: 5 hypotheses you would test, the queries you would run, and the interventions you would recommend.
I would structure the diagnosis across 5 mutually exclusive hypotheses:
H1 — Cohort Quality Degradation: The newer cohort was acquired via lower-quality channels (discount-driven, not intent-driven) and was never genuinely engaged. Query: compare the acquisition channel distribution of the two cohorts and their Day-7 repeat rate. Intervention: tighten acquisition targeting and reduce deep-discount new-user offers.
H2 — Increased Competition: Swiggy or a local player ran an aggressive promotional campaign targeting the same users during the same period. Query: segment the newer cohort by city and check if the drop is concentrated in cities where Swiggy increased restaurant coverage. Intervention: defensive promotions in at-risk cities, exclusive restaurant partnerships.
H3 — Product or UX Degradation: A product change (menu redesign, search algorithm update, delivery ETA display) negatively impacted the experience for new users. Query: compare app session-to-order conversion rate for both cohorts at the same cohort age, and check for any feature rollouts in the 6-month period. Intervention: rollback or A/B test the suspect feature.
H4 — Restaurant Supply Contraction: Fewer restaurants are accepting orders in the areas where newer cohort users live, leading to more failed sessions. Query: for each cohort, calculate the average number of available restaurants per user session and the failed-search rate. Intervention: improve restaurant onboarding in high-demand zones, launch cloud kitchens.
H5 — External / Seasonality Effect: The 6-month-ago cohort joined at a different season (e.g., post-festive January vs peak-season July) and the absolute period-over-period comparison is not lifecycle-adjusted. Query: normalize both cohorts to the same calendar period and re-run the comparison — if the gap shrinks, it is seasonality, not churn. Intervention: adjust cohort comparison methodology; use seasonally adjusted benchmarks going forward.
- If you could only run one query to triage this problem in 30 minutes, what would it be?
- How would you design an A/B test to measure the impact of a re-engagement campaign on this cohort?
- How do you separate the effect of acquisition channel quality from genuine lifecycle disengagement?
4-Week Zomato DA Interview Prep Plan
Build Foundation
Revise SQL fundamentals — GROUP BY, HAVING, JOINs, subqueries. Practice Python pandas — groupby, merge, pivot_table. Study case study frameworks: MECE, metric trees, funnel analysis. Read The Data Monk’s food-tech and product analytics interview pattern posts.
2200 Most Asked Analytics Interview Questions
Work through the SQL and Python sections — both are directly tested at Zomato. Focus especially on cohort analysis, window functions, and data cleaning chapters. Complete all 200+ SQL questions in the analytics section.
topmate.io/nitin_kamal
Ace Any SQL Interview
Work through all 220+ questions in the Ace Any SQL ebook. Cover window functions, CTEs, performance tuning, and data model design. Practice re-writing every solution from scratch without looking at the answer — timed at 12 minutes per question.
topmate.io/nitin_kamal
Revise + Timed Practice
Revise the first 5 chapters of the 2200 ebook. Do daily timed practice: 15 minutes per SQL question, 20 minutes per case study. Solve all 10 questions in this guide from memory. Practice explaining your thought process out loud.
topmate.io/nitin_kamal