If you are targeting a Walmart Data Engineer role (Walmart Global Tech / Walmart Labs), you are interviewing for one of the world’s largest retail data engineering environments — 10,500+ stores, 240 million weekly customers, and petabytes of POS, e-commerce, and supply chain data. The Walmart Data Engineer Interview Questions test advanced SQL for retail scenarios, PySpark for large-scale ETL, Hive/HDFS architecture, Airflow pipeline orchestration, and your ability to handle massive data skew from high-GMV product categories. This guide covers all three rounds with exact questions, complete answers, and the frameworks Walmart interviewers score on.
Python
Apache Spark
Hive
Airflow
HDFS
Round 1 — Walmart Data Engineer Interview Questions: SQL & Retail Analytics (60 mins)
| Questions | 5 |
| Duration | 60 minutes |
| Difficulty | Medium–Hard |
| Key Topics | Inventory TurnoverStore PerformanceSupplier AnalyticsWindow FunctionsSlowly Changing Dims |
The store operations team wants to flag stores whose weekly sales deviated by more than 2 standard deviations from their own 12-week rolling average. These are either exceptional performers (promotions success) or underperformers (supply issues). Return store, week, actual sales, rolling average, rolling std, z-score, and flag direction (ABOVE/BELOW).
Table:
- store_weekly_sales(store_id, week_start, department, total_sales)
AVG() OVER (... ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) includes the current week in the mean, which makes the z-score calculation slightly circular. The correct approach is to compute the rolling stats over the PRIOR 12 weeks (ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING) and then compare current week against them. Also: STDDEV_POP vs STDDEV_SAMP — the interviewer will ask which to use (STDDEV_SAMP for small n).WITH rolling AS (
SELECT
store_id,
week_start,
total_sales,
AVG(total_sales) OVER (
PARTITION BY store_id
ORDER BY week_start
ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
) AS rolling_avg_12w,
STDDEV_SAMP(total_sales) OVER (
PARTITION BY store_id
ORDER BY week_start
ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
) AS rolling_std_12w,
COUNT(*) OVER (
PARTITION BY store_id
ORDER BY week_start
ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
) AS window_size
FROM store_weekly_sales
),
zscore AS (
SELECT
store_id,
week_start,
total_sales,
ROUND(rolling_avg_12w, 2) AS rolling_avg,
ROUND(rolling_std_12w, 2) AS rolling_std,
ROUND((total_sales - rolling_avg_12w)
/ NULLIF(rolling_std_12w, 0), 2) AS z_score,
window_size
FROM rolling
WHERE window_size >= 8 -- need at least 8 weeks of history for reliable stats
)
SELECT
store_id,
week_start,
total_sales,
rolling_avg,
rolling_std,
z_score,
CASE WHEN z_score > 2 THEN 'ABOVE_NORMAL'
WHEN z_score < -2 THEN 'BELOW_NORMAL'
END AS flag
FROM zscore
WHERE ABS(z_score) > 2
ORDER BY week_start DESC, ABS(z_score) DESC;
ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING excludes the current row from the rolling window, making the z-score non-circular (comparing current week to a baseline computed entirely from prior data). The WHERE window_size >= 8 filter prevents false alarms for new stores with <8 weeks of history — a 2σ threshold with 2 data points is meaningless. NULLIF(rolling_std_12w, 0) handles the rare case where a store had identical sales every week (std = 0 → division by zero).- You find that 30% of stores are flagged as outliers in December every year. How would you adjust the model to account for seasonal spikes?
- This query runs on a Hive table with 5 years of data across 4,000 stores. What partition strategy would you use to avoid a full scan?
Compute the inventory turnover ratio (COGS / average inventory value) for each SKU in each store for last quarter. Flag SKUs with turnover ratio below the 10th percentile within their category — these are slow-moving items that need markdown or redistribution decisions.
Tables:
- inventory_snapshots(snapshot_id, store_id, sku_id, category, inventory_value, snapshot_date)
- sales_transactions(txn_id, store_id, sku_id, units_sold, cogs, txn_date)
WITH quarter_cogs AS (
SELECT
store_id,
sku_id,
SUM(cogs) AS total_cogs
FROM sales_transactions
WHERE txn_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
AND txn_date < DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY store_id, sku_id
),
avg_inventory AS (
SELECT
store_id,
sku_id,
category,
AVG(inventory_value) AS avg_inv_value
FROM inventory_snapshots
WHERE snapshot_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
AND snapshot_date < DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY store_id, sku_id, category
),
turnover AS (
SELECT
a.store_id,
a.sku_id,
a.category,
a.avg_inv_value,
c.total_cogs,
c.total_cogs / NULLIF(a.avg_inv_value, 0) AS turnover_ratio
FROM avg_inventory a
LEFT JOIN quarter_cogs c USING (store_id, sku_id)
),
with_pctile AS (
SELECT
*,
PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY turnover_ratio)
OVER (PARTITION BY category) AS p10_turnover
FROM turnover
)
SELECT
store_id,
sku_id,
category,
ROUND(avg_inv_value, 2) AS avg_inv_value,
ROUND(total_cogs, 2) AS total_cogs,
ROUND(turnover_ratio, 3) AS turnover_ratio,
ROUND(p10_turnover, 3) AS category_p10,
'SLOW_MOVING' AS flag
FROM with_pctile
WHERE turnover_ratio < p10_turnover
OR turnover_ratio IS NULL -- zero sales = definitely slow-moving
ORDER BY category, turnover_ratio NULLS FIRST;
PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY ...) OVER (PARTITION BY category) computes the 10th percentile per category using the ordered-set aggregate syntax — this is cleaner than a self-join. SKUs with zero COGS (no sales) get turnover_ratio = NULL from the LEFT JOIN — they are explicitly flagged in the WHERE turnover_ratio IS NULL clause since they are the worst slow-movers (zero velocity). At Walmart, turnover below the 10th percentile triggers an automated markdown recommendation workflow.- A SKU is slow-moving in Store A but the top-selling item in Store B. How would you factor cross-store performance before recommending a markdown?
- The inventory_snapshots table is 800 GB and is not partitioned. How would you restructure it to make this query efficient on Hive?
Round 2 — Walmart Data Engineer Interview Questions: Spark & Pipeline Design (60 mins)
| Questions | 4 |
| Duration | 60 minutes |
| Difficulty | Medium–Hard |
| Key Topics | Data SkewSpark OptimizationIncremental ETLPartitioningSCD Type 2 |
You are aggregating daily sales by store_id and sku_id from a 2 billion row transaction table. Spark keeps getting OOM errors on 3 reducers that handle “Electronics” SKUs — a single SKU (iPhone 16) contributes 40% of all rows. Diagnose the problem and fix it using PySpark.
df.groupBy('sku_id').count(); (2) Fix — salting the skewed key or using Spark’s adaptive query execution (AQE) with spark.sql.adaptive.skewJoin.enabled=true; (3) Trade-offs — salting requires a map-side de-salt step and can’t be used everywhere. Many candidates know AQE but don’t know salting.from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.skewJoin.enabled", "true") \
.config("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5") \
.config("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB") \
.getOrCreate()
# Approach 1: Enable AQE (zero code change, handles skew at runtime)
# AQE splits skewed partitions automatically — best for simple aggregations
df = spark.read.parquet("s3://walmart-datalake/transactions/dt=2026-05-19/")
# Approach 2: Manual salting (when AQE is not enough or Spark < 3.0)
SALT = 20 # number of salt buckets
df_salted = df.withColumn(
"salted_sku_id",
F.concat(
F.col("sku_id"),
F.lit("_"),
(F.rand() * SALT).cast("int").cast("string")
)
)
agg_salted = df_salted.groupBy("store_id", "salted_sku_id", "txn_date") \
.agg(
F.sum("revenue").alias("revenue"),
F.sum("units").alias("units"),
F.sum("cogs").alias("cogs")
)
# De-salt: strip the suffix and re-aggregate
agg_final = agg_salted \
.withColumn("sku_id",
F.split(F.col("salted_sku_id"), "_").getItem(0)
) \
.groupBy("store_id", "sku_id", "txn_date") \
.agg(
F.sum("revenue").alias("total_revenue"),
F.sum("units").alias("total_units"),
F.sum("cogs").alias("total_cogs")
)
agg_final.write.mode("overwrite") \
.partitionBy("txn_date") \
.parquet("s3://walmart-datalake/daily_sales_agg/")
- You need to JOIN this sales table with a 500 MB product master table. Which join strategy (broadcast vs sort-merge) should you use and why?
- The daily pipeline is running for 3 hours but has an SLA of 2 hours. What Spark metrics would you check first to identify the bottleneck?
Walmart’s store inventory changes thousands of times per day (restock, markdown, shrink). Design an incremental ETL pipeline that loads only new/changed inventory records since the last run, and maintains SCD Type 2 history (so analysts can query “what was the inventory of SKU X in Store Y on a specific date”).
valid_from, valid_to, is_current columns? (2) Can you design an efficient change detection strategy — updated_at watermark or CDC (Change Data Capture) via Debezium? (3) Do you think about the MERGE/UPSERT operation — which DE tools support it (Delta Lake, Hudi, Iceberg) vs which don’t (plain Hive)?Target table schema (SCD Type 2):
- inventory_history(surrogate_key, store_id, sku_id, inventory_qty, unit_cost, valid_from, valid_to, is_current)
valid_to = NULLandis_current = TRUEfor the active record
Incremental extraction (Airflow DAG, runs hourly):
- Read last successful watermark from a
pipeline_statetable:SELECT max_updated_at FROM pipeline_state WHERE pipeline='inventory_etl' - Pull changed records:
SELECT * FROM source_inventory WHERE updated_at > :watermark ORDER BY updated_at - Write to staging:
stg_inventory_changes(no SCD logic yet, just raw deltas)
SCD Type 2 MERGE (using Delta Lake on Azure Data Lake):
from delta.tables import DeltaTable
from pyspark.sql import functions as F
from datetime import datetime
# Load current target and staged changes
target = DeltaTable.forPath(spark, "abfss://warehouse/inventory_history/")
staged = spark.table("stg_inventory_changes")
# Step 1: Close out changed records in target (set valid_to = now, is_current = False)
target.alias("tgt").merge(
staged.alias("src"),
"tgt.store_id = src.store_id AND tgt.sku_id = src.sku_id AND tgt.is_current = true"
).whenMatchedUpdate(
condition = "tgt.inventory_qty != src.inventory_qty OR tgt.unit_cost != src.unit_cost",
set = {
"is_current": F.lit(False),
"valid_to": F.lit(datetime.utcnow())
}
).execute()
# Step 2: Insert new records for changed rows (is_current = True)
new_records = staged.join(
spark.table("inventory_history").filter("is_current = false AND valid_to >= current_timestamp() - interval 1 minute"),
on=["store_id", "sku_id"],
how="inner"
).select(
F.monotonically_increasing_id().alias("surrogate_key"),
"store_id", "sku_id", "inventory_qty", "unit_cost",
F.current_timestamp().alias("valid_from"),
F.lit(None).alias("valid_to"),
F.lit(True).alias("is_current")
)
new_records.write.format("delta").mode("append").save("abfss://warehouse/inventory_history/")
# Step 3: Update watermark
spark.sql(f"UPDATE pipeline_state SET max_updated_at = '{datetime.utcnow()}' WHERE pipeline = 'inventory_etl'")
updated_at > last_watermark will re-process recent changes safely (records that already exist in target will just trigger the “no update needed” branch of MERGE).- The source system doesn’t have an
updated_atcolumn. How do you detect changes for the incremental load? (Answer: CDC via Debezium reading the database transaction log) - How would a user query this SCD Type 2 table to find inventory as of a specific past date across all stores?
📚 Recommended Resources to Crack Walmart DE Interview
Used by 40,000+ engineers to crack data engineering and analytics interviews
- 2,200+ questions · 23 topics
- SQL, Python, Data Engineering, Spark
- Retail and e-commerce analytics patterns
- 220+ questions · SQL advanced, data models, DBMS
- Window functions, CTEs, query optimization
- Covers retail and supply chain query patterns
4-Week Prep Plan: Crack Walmart Data Engineer Interview
SQL + Spark Foundations + Walmart Product Context
Master window functions, rolling statistics, inventory analytics SQL. Study Spark architecture: RDDs vs DataFrames, partitioning, shuffles, AQE. Learn Walmart’s data landscape: POS data, supply chain, store hierarchy, e-commerce. Read TDM posts on data engineering interview patterns.
2200 Most Asked Analytics Interview Questions — DE & SQL Sections
Cover SQL, Python, data engineering, and pipeline design chapters. Focus on Spark optimization, ETL patterns, and incremental load scenarios. Do 100+ questions per day. End of Week 2: First mock interview →
Ace Any SQL Interview + Pipeline Architecture Deep Dive
Complete all 220+ questions in Ace Any SQL Interview. Practice 2 pipeline design problems per day — SCD Type 2, data skew handling, incremental ETL, Delta Lake MERGE. Study Airflow DAG design and error handling. End of Week 3: Second mock interview →
Timed Practice + Final Revision
Revise first 5 chapters of 2200 ebook. Timed SQL: 15 min per query. Timed Spark design: 25 min per system design. Simulate Walmart’s 3-round pace. End of Week 4: Final mock interview →