Home Case Study Zomato Data Analyst Interview Questions and Answers 2026
Case Study Company Interview Questions Data Analyst Interview Questions

Zomato Data Analyst Interview Questions and Answers 2026

Preparing for the Zomato Data Analyst interview in 2026? This guide covers the most commonly asked Zomato Data Analyst interview questions across all three rounds — SQL and data analysis, Python and pandas, and business case study. Every question includes a complete model answer, what the interviewer is actually testing, and targeted follow-up questions that come up in real Zomato interviews. Whether you are a fresher or have 1–3 years of experience, this is the most complete prep resource for the Zomato DA role.

Zomato
Data Analyst
Restaurant Analytics / Growth / Supply
SQL
Python
Pandas
Case Study
Cohort Analysis
Tableau / BI
Salary Range
₹8–14 LPA
1–3 Years Exp

Zomato Data Analyst — Interview Process
3 rounds · ~90 minutes total
1
SQL & Data Analysis
  • Restaurant health scoring
  • Hyperlocal demand estimation
  • Cohort retention analysis
  • Cuisine preference analytics
2
Python & Pandas
  • Data cleaning pipelines
  • Cohort retention in Python
  • ML feature engineering
  • Rating prediction model
3
Business Case Study
  • Cloud kitchen expansion
  • Declining order frequency
  • Hypothesis-driven thinking
  • Metric definition

Recommended Resources

Two books that cover everything tested in the Zomato DA interview
BESTSELLER

2200 Most Asked Analytics Interview Questions
  • 2,200+ questions · 23 topics
  • SQL, Python, product analytics, case studies
  • Company-wise question banks included
For: those who want to master all analytics interview topics
₹1,999 ₹7,999

Buy Now

SQL FOCUSED

Ace Any SQL Interview
  • 220+ questions · SQL advanced, data models, DBMS
  • Window functions, CTEs, performance tuning
  • Zomato-style restaurant data SQL problems
For: anyone who wants to crack any SQL round (fresher to advanced)
₹799

Buy Now

Round 1 — SQL & Data Analysis

SQL Coding Round (Restaurant & Food Delivery Focus)
Zomato SQL questions are always grounded in food delivery operations — restaurants, orders, users, ratings, geolocation, and subscriptions. The interviewer expects you to think about data quality (cancelled orders, NULLs), use CTEs for readability, and reason about business trade-offs. Window functions and conditional aggregation are heavily tested.
Questions 5
Duration 40 minutes
Difficulty Medium–Hard
Key Topics CTEs
Window Functions
Cohort Analysis
Conditional Aggregation
Geospatial Logic
Subscription Analytics

Q1 — Restaurant Health Score
Medium⏱ 12 min

The supply quality team wants a composite health score for every restaurant to identify at-risk partners early. Build a restaurant health score combining: order acceptance rate (40%), average rating (30%), and on-time preparation rate (30%) where prep_time_mins < 20 is considered on time. Rank restaurants by health score within each city.

Table:

  • orders(order_id, restaurant_id, user_id, order_date, status, rating, prep_time_mins) — status ∈ {placed, delivered, cancelled}
  • restaurants(restaurant_id, name, city)
🎯 What the interviewer tests
Do you know to calculate acceptance rate as delivered / placed (not delivered / all)? Many candidates mix up the denominator. They also test whether you use NULLIF for division-by-zero safety and whether you can compose a weighted score using ROUND and window functions for ranking within city partitions.

WITH restaurant_stats AS (
    SELECT
        o.restaurant_id,
        r.city,
        COUNT(CASE WHEN o.status = 'placed' THEN 1 END)     AS total_placed,
        COUNT(CASE WHEN o.status = 'delivered' THEN 1 END)  AS total_delivered,
        AVG(CASE WHEN o.status = 'delivered' THEN o.rating END) AS avg_rating,
        COUNT(CASE WHEN o.status = 'delivered'
                    AND o.prep_time_mins < 20 THEN 1 END)   AS on_time_count
    FROM orders o
    JOIN restaurants r ON o.restaurant_id = r.restaurant_id
    GROUP BY o.restaurant_id, r.city
),
health_scores AS (
    SELECT
        restaurant_id,
        city,
        total_delivered,
        avg_rating,
        ROUND(
            0.40 * (100.0 * total_delivered / NULLIF(total_placed, 0))
          + 0.30 * (COALESCE(avg_rating, 0) * 20)
          + 0.30 * (100.0 * on_time_count / NULLIF(total_delivered, 0)),
        2) AS health_score
    FROM restaurant_stats
)
SELECT
    restaurant_id,
    city,
    total_delivered,
    ROUND(avg_rating, 2)  AS avg_rating,
    health_score,
    RANK() OVER (PARTITION BY city ORDER BY health_score DESC) AS city_rank
FROM health_scores
ORDER BY city, city_rank;
💡 Key insight
The rating is on a 1–5 scale but the other two components are percentages (0–100). Multiply rating by 20 to normalize it to a 0–100 scale before applying the 30% weight — otherwise the score would be biased toward the percentage components. Use COALESCE(avg_rating, 0) for new restaurants with no delivered orders yet.
🔁 Common Follow-Up Questions
  • How would you add a time-decay factor so recent orders matter more than orders from 6 months ago?
  • Which restaurants are in the bottom 10% of health score in each city? Write that query.
  • If a restaurant has only 5 total orders, should its score be treated the same as one with 500 orders?


Q2 — Hyperlocal Demand Estimation
Hard⏱ 15 min

Zomato divides cities into 500m × 500m hexagons for supply planning. Using the orders table, calculate: (a) orders per hexagon per hour-of-week slot (168 slots = 24 hours × 7 days), (b) identify the top 20 hexagons by evening peak demand (6–9 PM), (c) find hexagons with high demand but zero restaurant coverage — defined as no restaurant within 1.5 km using a simplified flat-earth distance formula.

Tables:

  • orders(order_id, user_id, lat, long, restaurant_id, order_datetime, gmv)
  • restaurants(restaurant_id, name, lat, long)
🎯 What the interviewer tests
This question tests geospatial thinking and ability to bin continuous coordinates into grid cells. They want to see if you know the hex-bin formula (FLOOR(lat / 0.0045) for ~500m), can compute 168 hour-of-week slots, and can write a cross-join distance filter. Very few candidates get part (c) without a hint.

-- Part (a): Orders per hexagon per hour-of-week slot
WITH hexagon_demand AS (
    SELECT
        FLOOR(lat  / 0.0045) AS hex_lat,
        FLOOR(long / 0.0045) AS hex_lon,
        EXTRACT(DOW  FROM order_datetime) AS day_of_week,   -- 0=Sun
        EXTRACT(HOUR FROM order_datetime) AS hour_of_day,
        (EXTRACT(DOW FROM order_datetime) * 24
          + EXTRACT(HOUR FROM order_datetime))::INT         AS hour_of_week_slot,
        COUNT(order_id)  AS order_count,
        SUM(gmv)         AS total_gmv
    FROM orders
    GROUP BY hex_lat, hex_lon, day_of_week, hour_of_day, hour_of_week_slot
),
-- Part (b): Top 20 hexagons by evening peak demand (18:00–20:59)
evening_peak AS (
    SELECT
        hex_lat,
        hex_lon,
        SUM(order_count) AS evening_orders
    FROM hexagon_demand
    WHERE hour_of_day BETWEEN 18 AND 20
    GROUP BY hex_lat, hex_lon
    ORDER BY evening_orders DESC
    LIMIT 20
),
-- Part (c): High-demand hexagons with no restaurant within 1.5 km
-- 1 degree lat ~ 111 km, 1 degree lon ~ 111 km * cos(lat)
-- 1.5 km ~ 0.0135 degrees
hex_centers AS (
    SELECT
        hex_lat,
        hex_lon,
        (hex_lat + 0.5) * 0.0045 AS center_lat,
        (hex_lon + 0.5) * 0.0045 AS center_lon,
        SUM(order_count)          AS total_orders
    FROM hexagon_demand
    GROUP BY hex_lat, hex_lon
),
hex_with_restaurant AS (
    SELECT DISTINCT h.hex_lat, h.hex_lon
    FROM hex_centers h
    JOIN restaurants r
      ON ABS(h.center_lat - r.lat) < 0.0135
     AND ABS(h.center_lon - r.long) < 0.0135
)
SELECT
    h.hex_lat,
    h.hex_lon,
    h.total_orders,
    h.center_lat,
    h.center_lon
FROM hex_centers h
LEFT JOIN hex_with_restaurant hwr
  ON h.hex_lat = hwr.hex_lat AND h.hex_lon = hwr.hex_lon
WHERE hwr.hex_lat IS NULL
  AND h.total_orders > (SELECT PERCENTILE_CONT(0.75)
                          WITHIN GROUP (ORDER BY total_orders)
                        FROM hex_centers)
ORDER BY h.total_orders DESC;
💡 Key insight
FLOOR(lat / 0.0045) bins coordinates into ~500m grid cells (0.0045 degrees ≈ 500m). The simplified distance check ABS(lat1 - lat2) < 0.0135 approximates 1.5 km without the expensive SQRT haversine formula — acceptable for SQL interviews. The “high demand” threshold uses P75 to filter noise from low-volume hexagons.
🔁 Common Follow-Up Questions
  • How would you make this query work with actual hexagonal H3 grid IDs if Zomato stores them in the table?
  • How would demand patterns differ between weekday and weekend for the same hexagon?
  • If you had to recommend 5 new cloud kitchen locations, how would you use this analysis?


Q3 — Cohort Retention Analysis
Medium⏱ 12 min

Build a 3-month rolling cohort retention table. For each monthly cohort (users whose first order was in that month), calculate what percentage are still ordering in months 1, 2, and 3 after their cohort month. Use CTEs and conditional aggregation.

Table:

  • orders(order_id, user_id, order_date)
🎯 What the interviewer tests
Cohort retention is the single most-asked metric question at food delivery companies. The interviewer checks if you can correctly define the cohort month as the first order month (not account creation), handle the division accurately, and write a clean pivot without a library. Many candidates confuse months since cohort with calendar months.

WITH first_order AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(order_date)) AS cohort_month
    FROM orders
    GROUP BY user_id
),
user_activity AS (
    SELECT
        o.user_id,
        f.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        DATEDIFF('month', f.cohort_month,
                 DATE_TRUNC('month', o.order_date)) AS months_after
    FROM orders o
    JOIN first_order f ON o.user_id = f.user_id
    WHERE DATEDIFF('month', f.cohort_month,
                  DATE_TRUNC('month', o.order_date)) BETWEEN 0 AND 3
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
    FROM first_order
    GROUP BY cohort_month
),
retention_counts AS (
    SELECT
        cohort_month,
        COUNT(DISTINCT CASE WHEN months_after = 0 THEN user_id END) AS m0,
        COUNT(DISTINCT CASE WHEN months_after = 1 THEN user_id END) AS m1,
        COUNT(DISTINCT CASE WHEN months_after = 2 THEN user_id END) AS m2,
        COUNT(DISTINCT CASE WHEN months_after = 3 THEN user_id END) AS m3
    FROM user_activity
    GROUP BY cohort_month
)
SELECT
    r.cohort_month,
    cs.cohort_size,
    ROUND(100.0 * r.m0 / NULLIF(cs.cohort_size, 0), 1) AS month_0_pct,
    ROUND(100.0 * r.m1 / NULLIF(cs.cohort_size, 0), 1) AS month_1_pct,
    ROUND(100.0 * r.m2 / NULLIF(cs.cohort_size, 0), 1) AS month_2_pct,
    ROUND(100.0 * r.m3 / NULLIF(cs.cohort_size, 0), 1) AS month_3_pct
FROM retention_counts r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
ORDER BY r.cohort_month;
💡 Key insight
COUNT(DISTINCT user_id) inside CASE WHEN is the correct way to build a retention pivot in SQL — not a subquery for each month. Dividing by cohort_size (not by M0 count) gives absolute retention. M0 should always be 100% if the cohort definition is correct — if it’s not, there is a data quality issue.
🔁 Common Follow-Up Questions
  • How would you adjust this if you only want to count users who placed at least 2 orders in the activity month?
  • Which cohort month shows the steepest drop-off from M0 to M1? How would you find it?
  • How does this retention query change if the company defines “active” as ordering in any of the last 30 days rather than the calendar month?


Q4 — Cuisine Preference Analytics
Medium⏱ 10 min

Find: (a) Top 3 cuisines by order count per city, (b) Users who have ordered from 5 or more different cuisine types in the last 6 months (power users), (c) Cuisine pairs most often ordered together by the same user on the same day.

Tables:

  • orders(order_id, user_id, restaurant_id, order_date)
  • restaurants(restaurant_id, cuisine_type, city)
🎯 What the interviewer tests
Part (a) tests RANK() vs ROW_NUMBER() — you need RANK() or DENSE_RANK() to handle ties in top-3. Part (b) is straightforward COUNT(DISTINCT). Part (c) is a self-join on the same user + same day — a pattern that catches most candidates off guard. Producing cuisine pairs requires joining the table to itself.

-- Part (a): Top 3 cuisines by order count per city
WITH city_cuisine_orders AS (
    SELECT
        r.city,
        r.cuisine_type,
        COUNT(o.order_id) AS order_count,
        RANK() OVER (PARTITION BY r.city ORDER BY COUNT(o.order_id) DESC) AS rnk
    FROM orders o
    JOIN restaurants r ON o.restaurant_id = r.restaurant_id
    GROUP BY r.city, r.cuisine_type
)
SELECT city, cuisine_type, order_count, rnk
FROM city_cuisine_orders
WHERE rnk <= 3
ORDER BY city, rnk;

-- Part (b): Power users — 5+ distinct cuisine types in last 6 months
SELECT
    o.user_id,
    COUNT(DISTINCT r.cuisine_type) AS cuisine_variety
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY o.user_id
HAVING COUNT(DISTINCT r.cuisine_type) >= 5
ORDER BY cuisine_variety DESC;

-- Part (c): Most common cuisine pairs ordered by same user on same day
WITH user_day_cuisines AS (
    SELECT
        o.user_id,
        o.order_date,
        r.cuisine_type
    FROM orders o
    JOIN restaurants r ON o.restaurant_id = r.restaurant_id
),
cuisine_pairs AS (
    SELECT
        a.user_id,
        a.cuisine_type AS cuisine_a,
        b.cuisine_type AS cuisine_b
    FROM user_day_cuisines a
    JOIN user_day_cuisines b
      ON a.user_id    = b.user_id
     AND a.order_date = b.order_date
     AND a.cuisine_type < b.cuisine_type   -- avoid duplicates, enforce ordering
)
SELECT
    cuisine_a,
    cuisine_b,
    COUNT(*) AS pair_count
FROM cuisine_pairs
GROUP BY cuisine_a, cuisine_b
ORDER BY pair_count DESC
LIMIT 20;
💡 Key insight
The cuisine_a < cuisine_b condition in the self-join is critical — it ensures each pair appears only once (e.g., “Biryani+Pizza” not also “Pizza+Biryani”) and eliminates rows where a cuisine is paired with itself. This is the standard pattern for market basket analysis in SQL without needing a pivot or array operations.
🔁 Common Follow-Up Questions
  • How would you find the cuisine that has the highest growth rate month-over-month in the last quarter?
  • If Zomato wants to send personalized cuisine recommendations, how would you use this data to build a simple rule-based recommendation?
  • What if a restaurant offers multiple cuisines — how would you update the schema and query?


Q5 — Gold/Pro Subscriber Analysis
Hard⏱ 14 min

Zomato wants to evaluate whether the Gold subscription program is driving incremental value. Compare Gold vs non-Gold users on: average orders per month, average GMV per order, discount utilization rate, and churn rate (no order in the last 60 days). Use window functions for monthly order calculations.

Tables:

  • users(user_id, subscription_type, subscription_start, subscription_end)
  • orders(order_id, user_id, order_date, gmv, discount)
🎯 What the interviewer tests
This is the hardest SQL question in the round. The challenge is that “Gold” is not a static flag — users can subscribe and unsubscribe. You need to join orders to users and check whether the user was a Gold subscriber at the time of the order. Many candidates incorrectly use the current subscription_type and over-count Gold orders.

WITH order_segment AS (
    -- Tag each order with whether the user was Gold at time of order
    SELECT
        o.order_id,
        o.user_id,
        o.order_date,
        o.gmv,
        o.discount,
        CASE
            WHEN u.subscription_type = 'Gold'
             AND o.order_date BETWEEN u.subscription_start
                                  AND COALESCE(u.subscription_end, CURRENT_DATE)
            THEN 'Gold'
            ELSE 'Non-Gold'
        END AS segment
    FROM orders o
    JOIN users u ON o.user_id = u.user_id
),
monthly_orders AS (
    SELECT
        user_id,
        segment,
        DATE_TRUNC('month', order_date) AS order_month,
        COUNT(order_id)                 AS monthly_order_count
    FROM order_segment
    GROUP BY user_id, segment, order_month
),
avg_monthly AS (
    SELECT
        segment,
        AVG(monthly_order_count) AS avg_orders_per_month
    FROM monthly_orders
    GROUP BY segment
),
overall_metrics AS (
    SELECT
        segment,
        AVG(gmv)                                          AS avg_gmv_per_order,
        AVG(CASE WHEN discount > 0 THEN 1.0 ELSE 0 END)  AS discount_utilization_rate
    FROM order_segment
    GROUP BY segment
),
churn AS (
    SELECT
        u.user_id,
        CASE
            WHEN u.subscription_type = 'Gold'
            THEN 'Gold' ELSE 'Non-Gold'
        END AS segment,
        CASE
            WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '60 days'
              OR MAX(o.order_date) IS NULL
            THEN 1 ELSE 0
        END AS is_churned
    FROM users u
    LEFT JOIN orders o ON u.user_id = o.user_id
    GROUP BY u.user_id, segment
),
churn_rate AS (
    SELECT segment, AVG(is_churned) AS churn_rate
    FROM churn
    GROUP BY segment
)
SELECT
    am.segment,
    ROUND(am.avg_orders_per_month, 2)            AS avg_orders_per_month,
    ROUND(om.avg_gmv_per_order, 2)               AS avg_gmv_per_order,
    ROUND(om.discount_utilization_rate * 100, 1) AS discount_util_pct,
    ROUND(cr.churn_rate * 100, 1)                AS churn_rate_pct
FROM avg_monthly am
JOIN overall_metrics om ON am.segment = om.segment
JOIN churn_rate cr      ON am.segment = cr.segment;
💡 Key insight
The key insight is the time-based Gold check: order_date BETWEEN subscription_start AND subscription_end. This correctly classifies each order at the time it was placed. Using COALESCE(subscription_end, CURRENT_DATE) handles currently active subscribers who have no end date. Discount utilization is a rate (% of orders using discount), not total discount amount — define your metric before writing the query.
🔁 Common Follow-Up Questions
  • How would you calculate incremental GMV — the additional GMV Gold users generate versus a comparable non-Gold cohort?
  • If Gold users also get free delivery, how would you adjust the discount utilization metric?
  • How would you test whether the Gold program causes higher order frequency, or whether high-frequency users self-select into Gold?


Round 2 — Python & Pandas

Python Coding Round (Data Cleaning + Analysis + ML)
Zomato Python questions test real-world data engineering skills — you are expected to write clean, production-ready pandas code with proper handling of missing values, mixed types, and edge cases. The interviewer evaluates code readability alongside correctness. You should also be able to explain every pandas method you use.
Questions 3
Duration 30 minutes
Difficulty Medium–Hard
Key Topics Data Cleaning
Cohort Analysis
Pandas Pivot
Seaborn Heatmap
Random Forest
Feature Importance

Q6 — Pandas Data Cleaning Pipeline
Medium⏱ 12 min

You receive a CSV of restaurant reviews with the following quality issues: duplicate review_id, rating outside [1, 5], review_text containing only emojis or single words, reviewer_id is NaN for 12% of rows, and created_at in mixed date formats (ISO 8601 and DD/MM/YYYY). Write a complete Python cleaning pipeline. Output a clean DataFrame and a quality report dict with counts of issues found and fixed.

🎯 What the interviewer tests
The interviewer wants to see a systematic, non-destructive approach — detect issues first, then fix. They check whether you use pd.to_datetime(errors='coerce') to handle mixed formats, whether you use regex to detect emoji-only or single-word text, and whether you track issue counts in a structured way for auditability.

import pandas as pd
import re

def clean_reviews(filepath: str) -> tuple[pd.DataFrame, dict]:
    """Load, audit, and clean restaurant reviews CSV."""
    df = pd.read_csv(filepath)
    original_len = len(df)
    report = {}
    # ── 1. Duplicate review_id ──────────────────────────
    dupes = df.duplicated(subset='review_id', keep='first').sum()
    df = df.drop_duplicates(subset='review_id', keep='first')
    report['duplicate_review_ids_removed'] = int(dupes)
    # ── 2. Rating out of range ──────────────────────────
    invalid_rating = (~df['rating'].between(1, 5)).sum()
    df = df[df['rating'].between(1, 5)].copy()
    report['invalid_ratings_removed'] = int(invalid_rating)
    # ── 3. Low-quality review text ──────────────────────
    # Emoji-only: no ASCII letters; single-word: no whitespace after strip
    EMOJI_PATTERN = re.compile(
        r'^[\U00010000-\U0010ffff\U00002600-\U000027BF\U0001F300-\U0001F9FF\s]+$'
    )
    def is_low_quality(text) -> bool:
        if pd.isna(text):
            return True
        text = str(text).strip()
        if len(text.split()) <= 1:
            return True
        if EMOJI_PATTERN.match(text):
            return True
        return False
    low_quality_mask = df['review_text'].apply(is_low_quality)
    low_quality_count = low_quality_mask.sum()
    df = df[~low_quality_mask].copy()
    report['low_quality_reviews_removed'] = int(low_quality_count)
    # ── 4. Missing reviewer_id ──────────────────────────
    null_reviewer = df['reviewer_id'].isna().sum()
    df = df.dropna(subset=['reviewer_id']).copy()
    report['null_reviewer_ids_removed'] = int(null_reviewer)
    # ── 5. Mixed date formats ────────────────────────────
    def parse_date(val):
        if pd.isna(val):
            return pd.NaT
        # Try ISO first, then DD/MM/YYYY
        for fmt in ('%Y-%m-%dT%H:%M:%S', '%Y-%m-%d', '%d/%m/%Y'):
            try:
                return pd.to_datetime(val, format=fmt)
            except (ValueError, TypeError):
                pass
        return pd.NaT
    df['created_at'] = df['created_at'].apply(parse_date)
    unparseable_dates = df['created_at'].isna().sum()
    df = df.dropna(subset=['created_at']).copy()
    report['unparseable_dates_removed'] = int(unparseable_dates)
    report['original_rows'] = original_len
    report['clean_rows'] = len(df)
    report['total_rows_removed'] = original_len - len(df)
    df = df.reset_index(drop=True)
    return df, report

# Usage
clean_df, quality_report = clean_reviews('restaurant_reviews.csv')
print(quality_report)
💡 Key insight
Always audit before cleaning — track how many rows each rule removes so you can justify every decision. Use a format-loop for date parsing instead of dayfirst=True which can misparse ISO dates. The emoji regex targets Unicode blocks; do not rely on a simple isascii() check because many emoji fall in supplementary planes not captured by ASCII range alone.
🔁 Common Follow-Up Questions
  • How would you handle reviewer_id nulls by imputing from other reviews by the same device_id if that column exists?
  • How would you make this pipeline idempotent so running it twice gives the same result?
  • How would you unit-test the is_low_quality function with pytest?


Q7 — Cohort Retention in Python
Medium⏱ 12 min

Given a DataFrame df with columns [user_id, order_date, gmv], write Python code to: (a) assign cohort month (user’s first order month), (b) calculate cohort age (months since cohort month), (c) build a pivot table — rows = cohort_month, cols = cohort_age, values = retention_rate, and (d) plot a heatmap where color intensity represents retention rate. Use only pandas and matplotlib/seaborn.

🎯 What the interviewer tests
This is the Python equivalent of Q3. They check if you use groupby + transform('min') to assign cohort months efficiently (not a slow apply loop), can compute period differences using pd.Period arithmetic, and build a correct pivot table. The heatmap is a bonus — most candidates forget to annotate cells with retention percentages.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure order_date is datetime
df['order_date'] = pd.to_datetime(df['order_date'])
# ── (a) Cohort month: user's first order month ──────────
df['order_period'] = df['order_date'].dt.to_period('M')
df['cohort_month'] = (
    df.groupby('user_id')['order_period']
      .transform('min')
)
# ── (b) Cohort age: months since cohort month ───────────
df['cohort_age'] = (df['order_period'] - df['cohort_month']).apply(lambda x: x.n)
# ── (c) Cohort sizes and retention pivot ────────────────
cohort_sizes = (
    df[df['cohort_age'] == 0]
      .groupby('cohort_month')['user_id']
      .nunique()
      .rename('cohort_size')
)
retention_counts = (
    df.groupby(['cohort_month', 'cohort_age'])['user_id']
      .nunique()
      .reset_index()
      .rename(columns={'user_id': 'active_users'})
)
retention_counts = retention_counts.merge(cohort_sizes, on='cohort_month')
retention_counts['retention_rate'] = (
    retention_counts['active_users'] / retention_counts['cohort_size'] * 100
).round(1)
cohort_pivot = retention_counts.pivot_table(
    index='cohort_month',
    columns='cohort_age',
    values='retention_rate'
)
cohort_pivot.index = cohort_pivot.index.astype(str)
# ── (d) Heatmap ─────────────────────────────────────────
fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(
    cohort_pivot,
    annot=True,
    fmt='.1f',
    cmap='YlOrRd_r',
    linewidths=0.5,
    linecolor='#f1f5f9',
    cbar_kws={'label': 'Retention Rate (%)'},
    ax=ax
)
ax.set_title('Zomato User Cohort Retention (Monthly)', fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Months Since First Order', fontsize=11)
ax.set_ylabel('Cohort Month', fontsize=11)
plt.tight_layout()
plt.savefig('cohort_retention_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print(cohort_pivot)
💡 Key insight
dt.to_period('M') converts dates to Period objects that support month subtraction natively — (period_a - period_b).n gives integer months between them. This avoids error-prone manual date math. transform('min') broadcasts the cohort month back to each row without a separate merge step, keeping the code concise.
🔁 Common Follow-Up Questions
  • How would you add a GMV-weighted retention metric alongside the user count retention?
  • How would you identify cohorts that have significantly better M3 retention than average?
  • If you wanted to show week-over-week retention for the first 8 weeks, what would change in the code?


Q8 — Restaurant Rating Prediction
Medium⏱ 14 min

Build a restaurant rating prediction model. Given features: [cuisine_type, avg_prep_time, num_items_on_menu, city_tier, avg_order_value, has_veg_only, weekend_rush_ratio], predict avg rating (1–5). Write Python code for: feature encoding, train-test split, model training using Random Forest, evaluation with MAE and MAPE, and identify the most important feature.

🎯 What the interviewer tests
This is an end-to-end ML pipeline question. The interviewer checks whether you remember to encode categoricals before fitting, whether you scale correctly (Random Forests don’t need scaling but you should say so), whether you know MAE vs MAPE trade-offs, and whether you use feature_importances_ correctly. They may ask why you chose Random Forest over linear regression.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error

# ── Feature engineering ──────────────────────────────────
# df columns: cuisine_type, avg_prep_time, num_items_on_menu,
#             city_tier, avg_order_value, has_veg_only,
#             weekend_rush_ratio, avg_rating (target)

CATEGORICAL_COLS = ['cuisine_type', 'city_tier']
BOOLEAN_COLS     = ['has_veg_only']
NUMERIC_COLS     = ['avg_prep_time', 'num_items_on_menu',
                    'avg_order_value', 'weekend_rush_ratio']
TARGET           = 'avg_rating'

df_model = df.copy()
# Encode categoricals with LabelEncoder
le = {}
for col in CATEGORICAL_COLS:
    le[col] = LabelEncoder()
    df_model[col] = le[col].fit_transform(df_model[col].astype(str))
# Boolean: convert True/False to 1/0
for col in BOOLEAN_COLS:
    df_model[col] = df_model[col].astype(int)
FEATURES = CATEGORICAL_COLS + BOOLEAN_COLS + NUMERIC_COLS
X = df_model[FEATURES]
y = df_model[TARGET]
# ── Train-test split ─────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# ── Model training ───────────────────────────────────────
# Random Forest — no feature scaling needed
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=8,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
# ── Evaluation ───────────────────────────────────────────
y_pred = rf.predict(X_test)
mae  = mean_absolute_error(y_test, y_pred)
# MAPE: avoid division by zero (ratings are always >= 1)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f"MAE  : {mae:.3f}")
print(f"MAPE : {mape:.2f}%")
# ── Feature importance ───────────────────────────────────
importance_df = pd.DataFrame({
    'feature'   : FEATURES,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importances:")
print(importance_df.to_string(index=False))
top_feature = importance_df.iloc[0]['feature']
print(f"\nMost important feature: {top_feature}")
💡 Key insight
For restaurant rating prediction, avg_prep_time typically emerges as the most important feature — faster preparation correlates strongly with higher ratings because delivery time directly impacts user satisfaction in food delivery. MAPE is more interpretable than MAE here because it expresses error as a percentage of the actual rating, making it easier to communicate to business stakeholders (“our model is off by X% on average”).
🔁 Common Follow-Up Questions
  • Why did you choose Random Forest over linear regression or XGBoost for this task?
  • How would you handle a restaurant that just launched with zero historical orders — its features would all be null or zero?
  • How would you serve this model in production so Zomato’s backend can call it in real time?


Round 3 — Business Case Study

Business Analytics & Case Study Round
This round tests your ability to structure ambiguous business problems, define the right metrics, design data analysis plans, and communicate findings to non-technical stakeholders. Zomato interviewers look for a structured hypothesis-driven approach — not just intuition. You are expected to think out loud.
Questions 2
Duration 20 minutes
Difficulty Medium
Key Topics Metric Definition
Root Cause Analysis
Hypothesis Testing
Business Strategy
Data Storytelling

Q9 — Cloud Kitchen Expansion in Pune
Medium⏱ 10 min

Zomato is considering launching cloud kitchens (delivery-only restaurants) in Pune. What data analysis would you do to: (a) select which 5 cuisines to launch, (b) identify the top 3 neighborhoods, and (c) set pricing? Define your data sources and analysis methodology.

🎯 What the interviewer tests
They want to see structured thinking — MECE decomposition, not a random list of ideas. The best answers explicitly separate supply-side data (what restaurants exist) from demand-side data (what users order), and use the gap between them to drive decisions. They also check whether you define pricing strategy correctly: cost-plus vs competitive vs value-based pricing.

Part (a) — Cuisine Selection:

I would analyze demand-supply gap rather than pure demand volume. The key data points needed are:

  • Demand signal: Search queries and failed-search events in Pune for each cuisine — high search volume with low order conversion = unmet demand
  • Current supply: Count of restaurants per cuisine type in Pune using Zomato’s restaurant database
  • Order velocity: Orders per restaurant per day for each cuisine — cuisines with high velocity but few restaurants indicate underserved demand
  • Margin profile: Average GMV and estimated food cost per cuisine (use historical discount patterns and food category benchmarks)

Decision criteria: rank cuisines by a composite score = (demand volume × 0.35) + (supply gap ratio × 0.40) + (margin potential × 0.25). Launch the top 5. Typically, Biryani, North Indian, Chinese, South Indian, and Desserts perform well in Tier-1 Indian cities.

Part (b) — Neighborhood Selection:

  • Use the hyperlocal demand model (hexagon grid from Q2) to identify the top demand hexagons in Pune with low restaurant density
  • Overlay with demographic data: office clusters, PG/hostel zones, and residential areas with high young-professional population have the highest delivery frequency
  • Consider delivery time isochrones: select neighborhoods where a cloud kitchen can serve a 3 km radius within 30 minutes, covering maximum hexagons per kitchen

Top 3 target neighborhoods in Pune: Hinjewadi IT corridor (office demand), Viman Nagar (young professionals, dense), Kothrud (large residential, underserved by current supply).

Part (c) — Pricing:

  • Baseline: Competitive pricing — benchmark against top-3 restaurants of the same cuisine in the same neighborhood (Zomato has this data internally)
  • Adjustment: Since cloud kitchens have lower overhead (no dine-in), price 5–10% below market average to drive initial trial orders
  • Dynamic pricing: Increase prices by 15–20% during peak demand slots (6–9 PM Friday–Sunday) where price elasticity is lower because users have fewer alternatives
  • Minimum order value: Set at ₹199–₹249 to ensure delivery economics are positive; communicate clearly to avoid cart abandonment
💡 Key insight
The differentiator in this answer is the demand-supply gap framework — not just “which cuisines are ordered most” but “which cuisines are ordered most relative to available supply.” Zomato already has this data internally through failed searches and high-wait-time signals. Always tie your analysis to a specific metric and decision threshold.
🔁 Common Follow-Up Questions
  • How would you measure whether the cloud kitchen launch was successful — what are your North Star and guardrail metrics?
  • If the first cloud kitchen in Hinjewadi underperforms after 3 months, what’s your diagnostic framework?
  • How would you decide when to add a second cuisine to an existing cloud kitchen location?


Q10 — Declining Order Frequency Diagnosis
Medium⏱ 10 min

A cohort of users who joined 6 months ago is placing 40% fewer orders than the same cohort from 12 months ago did at the same stage of their lifecycle. Walk through your diagnostic framework: 5 hypotheses you would test, the queries you would run, and the interventions you would recommend.

🎯 What the interviewer tests
This is a root-cause analysis question. They want MECE hypotheses that cover all possible causes: supply-side (restaurant quality, availability), demand-side (price sensitivity, competition), product (app experience), external (seasonality, macro). Great candidates distinguish between correlation and causation and propose interventions that are testable via A/B experiments.

I would structure the diagnosis across 5 mutually exclusive hypotheses:

H1 — Cohort Quality Degradation: The newer cohort was acquired via lower-quality channels (discount-driven, not intent-driven) and was never genuinely engaged. Query: compare the acquisition channel distribution of the two cohorts and their Day-7 repeat rate. Intervention: tighten acquisition targeting and reduce deep-discount new-user offers.

H2 — Increased Competition: Swiggy or a local player ran an aggressive promotional campaign targeting the same users during the same period. Query: segment the newer cohort by city and check if the drop is concentrated in cities where Swiggy increased restaurant coverage. Intervention: defensive promotions in at-risk cities, exclusive restaurant partnerships.

H3 — Product or UX Degradation: A product change (menu redesign, search algorithm update, delivery ETA display) negatively impacted the experience for new users. Query: compare app session-to-order conversion rate for both cohorts at the same cohort age, and check for any feature rollouts in the 6-month period. Intervention: rollback or A/B test the suspect feature.

H4 — Restaurant Supply Contraction: Fewer restaurants are accepting orders in the areas where newer cohort users live, leading to more failed sessions. Query: for each cohort, calculate the average number of available restaurants per user session and the failed-search rate. Intervention: improve restaurant onboarding in high-demand zones, launch cloud kitchens.

H5 — External / Seasonality Effect: The 6-month-ago cohort joined at a different season (e.g., post-festive January vs peak-season July) and the absolute period-over-period comparison is not lifecycle-adjusted. Query: normalize both cohorts to the same calendar period and re-run the comparison — if the gap shrinks, it is seasonality, not churn. Intervention: adjust cohort comparison methodology; use seasonally adjusted benchmarks going forward.

💡 Key insight
Before jumping to interventions, always check whether the metric definition is correct. A 40% drop sounds alarming but could be explained entirely by seasonality (H5). Validate the measurement first. If the drop is real, prioritize H1 and H3 because they are fully within Zomato’s control and can be fixed fastest. H2 and H4 require external coordination.
🔁 Common Follow-Up Questions
  • If you could only run one query to triage this problem in 30 minutes, what would it be?
  • How would you design an A/B test to measure the impact of a re-engagement campaign on this cohort?
  • How do you separate the effect of acquisition channel quality from genuine lifecycle disengagement?


4-Week Zomato DA Interview Prep Plan

WEEK 1

Build Foundation

Revise SQL fundamentals — GROUP BY, HAVING, JOINs, subqueries. Practice Python pandas — groupby, merge, pivot_table. Study case study frameworks: MECE, metric trees, funnel analysis. Read The Data Monk’s food-tech and product analytics interview pattern posts.

WEEK 2

2200 Most Asked Analytics Interview Questions

Work through the SQL and Python sections — both are directly tested at Zomato. Focus especially on cohort analysis, window functions, and data cleaning chapters. Complete all 200+ SQL questions in the analytics section.

End of Week 2: Book your first mock interview →
topmate.io/nitin_kamal
WEEK 3

Ace Any SQL Interview

Work through all 220+ questions in the Ace Any SQL ebook. Cover window functions, CTEs, performance tuning, and data model design. Practice re-writing every solution from scratch without looking at the answer — timed at 12 minutes per question.

End of Week 3: Book your second mock interview →
topmate.io/nitin_kamal
WEEK 4

Revise + Timed Practice

Revise the first 5 chapters of the 2200 ebook. Do daily timed practice: 15 minutes per SQL question, 20 minutes per case study. Solve all 10 questions in this guide from memory. Practice explaining your thought process out loud.

End of Week 4: Final mock interview →
topmate.io/nitin_kamal
Summary: ~950 questions + 3 mock interviews = enough to crack the Zomato Data Analyst interview. Or shortcut the process with our mentorship program (3-tier packages below).


Scroll to Top