If you are targeting a Paytm BI Engineer role, you need to be ready for a pipeline-heavy interview unlike standard DA roles. The Paytm BI Engineer Interview Questions test advanced SQL for payments data (failed transactions, wallet balances, UPI flows), data pipeline architecture (Spark, Hive, Airflow scheduling), dashboard design for regulatory and business reporting, and your ability to handle near-real-time fintech data at scale. This guide covers all three rounds with exact questions, complete answers, and the frameworks that Paytm interviewers look for.
Python
Data Pipelines
Dashboard Design
Spark/Hive
Round 1 β Paytm BI Engineer Interview Questions: Advanced SQL (60 mins)
| Questions | 5 |
| Duration | 60 minutes |
| Difficulty | MediumβHard |
| Key Topics | Running BalancesFailure AttributionUPI FunnelWindow FunctionsMerchant Analytics |
The payments ops team wants to understand which banks are causing the most UPI failures. For each bank (identified by IFSC prefix) and failure reason, compute: total failures, failure rate vs total attempts, and consecutive failure streak β the maximum number of consecutive failed transactions for any single user on that bank in the last 7 days.
Tables:
- upi_transactions(txn_id, user_id, bank_ifsc, amount, status, failure_reason, created_at)
ROW_NUMBER() from a running count of failures β consecutive failures will share the same group. The interviewer wants to see if you know this pattern without a hint. Mention IFSC prefix extraction via LEFT(bank_ifsc, 4) to group banks.WITH recent_txns AS (
SELECT
txn_id,
user_id,
LEFT(bank_ifsc, 4) AS bank_prefix,
status,
failure_reason,
created_at,
ROW_NUMBER() OVER (PARTITION BY user_id, LEFT(bank_ifsc,4)
ORDER BY created_at) AS rn_all,
ROW_NUMBER() OVER (PARTITION BY user_id, LEFT(bank_ifsc,4),
CASE WHEN status='failed' THEN 1 ELSE 0 END
ORDER BY created_at) AS rn_status
FROM upi_transactions
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
),
failure_groups AS (
SELECT
txn_id, user_id, bank_prefix, status, failure_reason, created_at,
CASE WHEN status = 'failed'
THEN rn_all - rn_status END AS grp_id
FROM recent_txns
WHERE status = 'failed'
),
streaks AS (
SELECT
user_id, bank_prefix,
grp_id,
COUNT(*) AS streak_len
FROM failure_groups
GROUP BY user_id, bank_prefix, grp_id
),
max_streaks AS (
SELECT bank_prefix, MAX(streak_len) AS max_streak
FROM streaks
GROUP BY bank_prefix
),
failure_summary AS (
SELECT
LEFT(bank_ifsc, 4) AS bank_prefix,
failure_reason,
COUNT(*) AS total_failures,
COUNT(*) * 1.0 / NULLIF(SUM(COUNT(*)) OVER
(PARTITION BY LEFT(bank_ifsc,4)), 0) * 100 AS failure_rate_pct
FROM upi_transactions
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY LEFT(bank_ifsc, 4), failure_reason
)
SELECT
fs.bank_prefix,
fs.failure_reason,
fs.total_failures,
ROUND(fs.failure_rate_pct, 1) AS failure_rate_pct,
ms.max_streak AS max_consecutive_failures
FROM failure_summary fs
LEFT JOIN max_streaks ms USING (bank_prefix)
ORDER BY fs.total_failures DESC;
rn_all - rn_status gives a stable group ID for consecutive runs of the same status. Within a failure group, all rows share the same grp_id even if their absolute row numbers differ. NULLIF(SUM(COUNT(*)) OVER PARTITION BY bank_prefix, 0) computes total attempts per bank (not just failures) as the denominator for failure rate.- If a bank has a 40% failure rate on UPI but it’s only 2% of Paytm’s transaction volume, how do you decide whether to escalate to the bank’s API team?
- How would you build a real-time alert when any bank’s 15-minute rolling failure rate exceeds 25%? What infrastructure would you use?
The risk team reports that some users had a negative wallet balance at certain points (a system bug). Compute each user’s running wallet balance after every transaction (credits positive, debits negative) and flag the first transaction that caused a negative balance for each user.
Table:
- wallet_transactions(txn_id, user_id, txn_type, amount, created_at)
txn_type:'credit'or'debit'
SUM() OVER (PARTITION BY user_id ORDER BY created_at ROWS UNBOUNDED PRECEDING) is the core technique. The interviewer checks whether you handle txn_type correctly β debits must be sign-flipped. The “first negative” flag requires a secondary filter: WHERE running_balance < 0 followed by ROW_NUMBER() = 1 per user to get only the first offending event.WITH signed_txns AS (
SELECT
txn_id,
user_id,
created_at,
txn_type,
amount,
CASE WHEN txn_type = 'credit' THEN amount
WHEN txn_type = 'debit' THEN -amount
END AS signed_amount
FROM wallet_transactions
),
running AS (
SELECT
txn_id,
user_id,
txn_type,
amount,
created_at,
SUM(signed_amount) OVER (
PARTITION BY user_id
ORDER BY created_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_balance
FROM signed_txns
),
first_negative AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS rn
FROM running
WHERE running_balance < 0
)
SELECT
user_id,
txn_id AS first_negative_txn,
txn_type,
amount,
created_at,
running_balance AS balance_after_txn
FROM first_negative
WHERE rn = 1
ORDER BY created_at;
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is explicit and avoids subtle bugs with duplicate timestamps. If two transactions share the same created_at, add a tie-breaker: ORDER BY created_at, txn_id. At Paytm, this type of audit query is used post-incident to identify the exact debit transaction that triggered the race condition in the wallet service.- If the same user had 3 separate negative balance events across different dates, how do you report all 3 with the balance before and after each?
- The wallet table has 500 million rows. How would you optimize this query β partitioning strategy, indexing, or materialized views?
The merchant analytics team wants a report showing each merchant’s GMV for last month, their rank within their category, and their GMV percentile within that category. Also flag merchants who dropped from the top quartile last month to below the median this month.
Tables:
- merchant_transactions(txn_id, merchant_id, category, gmv, txn_date, status)
WITH monthly_gmv AS (
SELECT
merchant_id,
category,
DATE_TRUNC('month', txn_date) AS month,
SUM(gmv) AS total_gmv
FROM merchant_transactions
WHERE status = 'success'
AND txn_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '2 months')
GROUP BY merchant_id, category, DATE_TRUNC('month', txn_date)
),
ranked AS (
SELECT
merchant_id,
category,
month,
total_gmv,
RANK() OVER (PARTITION BY category, month ORDER BY total_gmv DESC) AS gmv_rank,
PERCENT_RANK() OVER (PARTITION BY category, month ORDER BY total_gmv) AS gmv_percentile
FROM monthly_gmv
),
curr AS (SELECT * FROM ranked WHERE month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')),
prev AS (SELECT * FROM ranked WHERE month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '2 months'))
SELECT
c.merchant_id,
c.category,
c.total_gmv,
c.gmv_rank,
ROUND(c.gmv_percentile * 100, 1) AS gmv_percentile_pct,
CASE
WHEN p.gmv_percentile >= 0.75
AND c.gmv_percentile < 0.50 THEN 'DROPPED_FROM_TOP_QUARTILE'
ELSE 'STABLE'
END AS status_flag
FROM curr c
LEFT JOIN prev p USING (merchant_id, category)
ORDER BY c.category, c.gmv_rank;
PERCENT_RANK() returns 0 for the lowest value and 1 for the highest. So gmv_percentile >= 0.75 means top quartile last month. The LEFT JOIN preserves merchants who are new this month (no prev record) β they get NULL for prev percentile and default to STABLE in the CASE. Make sure to filter status = 'success' β Paytm’s payment tables include pending and failed rows that inflate GMV if unfiltered.- How would you build a Metabase or Superset dashboard that shows this merchant ranking table and auto-refreshes daily from the data pipeline?
- A merchant operates across two categories β how do you report their cross-category total GMV while preserving category-level ranking?
Round 2 β Paytm BI Engineer Interview Questions: Pipeline Design & Architecture (45 mins)
| Questions | 3 |
| Duration | 45 minutes |
| Difficulty | Medium |
| Key Topics | Airflow DAGsIncremental LoadPartitioningData QualityReal-time vs Batch |
The Finance team needs a daily dashboard showing: total transaction volume (success/failure by category), net settlement amount, and top 10 UPI failure reasons β all refreshed by 7 AM IST from yesterday’s data. Design the complete pipeline from raw data to dashboard.
Layer 1 β Raw Ingestion (12 AM β 2 AM):
- Kafka consumers write payment events to HDFS/S3 raw zone, partitioned by
dt=YYYY-MM-DD/hr=HH - No transformation β raw events preserved for audit and replay
- Airflow DAG:
ingest_payments_rawβ hourly, checks last Kafka offset committed per partition
Layer 2 β Staging / Transformation (2 AM β 4 AM):
- PySpark job reads yesterday’s raw partition, applies cleaning: dedup on
txn_id, correct IFSC format, cast amount to decimal - Write to staging Hive table:
stg_paymentspartitioned bydt - Data quality checks (Great Expectations or custom):
- Row count vs Kafka offset count: <0.1% variance allowed
- NULL rate on
bank_ifscβ€ 2% (more means upstream issue) - Settlement amount sign consistency: all debits negative
Layer 3 β Aggregation / Presentation (4 AM β 5:30 AM):
- Hive/SparkSQL job computes KPI aggregates β writes to
rpt_daily_payments_kpi - Table materialized in the BI layer (Redshift or BigQuery) for fast dashboard queries
Layer 4 β Dashboard Refresh (5:30 AM β 6:30 AM):
- Superset/Metabase scheduled refresh triggered by Airflow sensor on KPI table write timestamp
- Finance team alert: Slack notification at 7 AM confirming dashboard ready
SLA breach handling: If staging fails, Airflow retries 3x with 10-minute backoff. If still failing at 6 AM, PagerDuty alert to on-call BI engineer. Dashboard shows “Data as of D-2” with a banner until D-1 data is backfilled.
- A batch from 2:15 AM arrived late at 4:30 AM. How does your pipeline detect and re-process the late data without re-running the full day?
- The Finance team wants the dashboard refreshed every 15 minutes during market hours (9 AM β 5 PM). How does your architecture change?
π Recommended Resources to Crack Paytm BI Engineer Interview
Used by 40,000+ analysts to crack fintech and product company interviews
- 2,200+ questions Β· 23 topics
- SQL, Python, Data Engineering, BI Design
- Fintech-specific scenario questions
- 220+ questions Β· SQL advanced, data models, DBMS
- Window functions, CTEs, query optimization
- Covers fintech payment table patterns
4-Week Prep Plan: Crack Paytm BI Engineer Interview
SQL Foundation β Window Functions + Pipeline Basics
Master running totals, gaps-and-islands, percentile functions, and multi-level CTEs. Learn Airflow DAG design basics. Study Paytm’s product model: UPI, wallet, merchant payments, KYC. Read TDM posts on BI engineer interview patterns for fintech.
2200 Most Asked Analytics Interview Questions β SQL & DE Sections
Cover SQL, Python, data engineering, and BI design chapters. Focus on payment-domain questions (transaction success rates, rolling averages, funnel analysis). Do 100+ questions per day. End of Week 2: First mock interview β
Ace Any SQL Interview + Pipeline Architecture Practice
Complete all 220+ questions in Ace Any SQL Interview. Practice 2 pipeline design questions per day β walk through: ingestion β staging β aggregation β dashboard β SLA. Study Spark partitioning, Hive optimization, and Airflow retry patterns. End of Week 3: Second mock interview β
Timed Practice + Revision
Revise first 5 chapters of 2200 ebook. Timed SQL practice: 15 min per query. Simulate Paytm’s pipeline design round under 45-minute time pressure. End of Week 4: Final mock interview β