Home Business Intelligence Engineer Walmart Data Engineer Interview Questions and Answers 2026
Business Intelligence Engineer Case Study Company Interview Questions Interview Questions

Walmart Data Engineer Interview Questions and Answers 2026

If you are targeting a Walmart Data Engineer role (Walmart Global Tech / Walmart Labs), you are interviewing for one of the world’s largest retail data engineering environments — 10,500+ stores, 240 million weekly customers, and petabytes of POS, e-commerce, and supply chain data. The Walmart Data Engineer Interview Questions test advanced SQL for retail scenarios, PySpark for large-scale ETL, Hive/HDFS architecture, Airflow pipeline orchestration, and your ability to handle massive data skew from high-GMV product categories. This guide covers all three rounds with exact questions, complete answers, and the frameworks Walmart interviewers score on.

Walmart
Data Engineer
Retail Data Engineering / Supply Chain Analytics
SQL
Python
Apache Spark
Hive
Airflow
HDFS
Salary Range
₹15–28 LPA
2–6 Years Exp

Round 1 — Walmart Data Engineer Interview Questions: SQL & Retail Analytics (60 mins)

SQL Round (Retail Data Focus)
Walmart’s SQL round tests your ability to work with retail data at scale — transactions, inventory, store hierarchies, supplier shipments. The interviewer looks for efficient query patterns: partitioning awareness, avoiding full scans, handling data skew from Tier 1 products (top 1% of SKUs drive 40% of GMV). You are expected to explain execution plans, not just write queries.
Questions 5
Duration 60 minutes
Difficulty Medium–Hard
Key Topics Inventory TurnoverStore PerformanceSupplier AnalyticsWindow FunctionsSlowly Changing Dims
Q1 — Store Sales Seasonality: Identify Week-Over-Week Outliers
Hard⏱ 20 min

The store operations team wants to flag stores whose weekly sales deviated by more than 2 standard deviations from their own 12-week rolling average. These are either exceptional performers (promotions success) or underperformers (supply issues). Return store, week, actual sales, rolling average, rolling std, z-score, and flag direction (ABOVE/BELOW).

Table:

  • store_weekly_sales(store_id, week_start, department, total_sales)
🎯 What the interviewer tests
Rolling statistics with window functions. The trap: AVG() OVER (... ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) includes the current week in the mean, which makes the z-score calculation slightly circular. The correct approach is to compute the rolling stats over the PRIOR 12 weeks (ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING) and then compare current week against them. Also: STDDEV_POP vs STDDEV_SAMP — the interviewer will ask which to use (STDDEV_SAMP for small n).

WITH rolling AS (
    SELECT
        store_id,
        week_start,
        total_sales,
        AVG(total_sales) OVER (
            PARTITION BY store_id
            ORDER BY week_start
            ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
        )  AS rolling_avg_12w,
        STDDEV_SAMP(total_sales) OVER (
            PARTITION BY store_id
            ORDER BY week_start
            ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
        )  AS rolling_std_12w,
        COUNT(*) OVER (
            PARTITION BY store_id
            ORDER BY week_start
            ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
        )  AS window_size
    FROM store_weekly_sales
),
zscore AS (
    SELECT
        store_id,
        week_start,
        total_sales,
        ROUND(rolling_avg_12w, 2)                                       AS rolling_avg,
        ROUND(rolling_std_12w, 2)                                       AS rolling_std,
        ROUND((total_sales - rolling_avg_12w)
              / NULLIF(rolling_std_12w, 0), 2)                          AS z_score,
        window_size
    FROM rolling
    WHERE window_size >= 8    -- need at least 8 weeks of history for reliable stats
)
SELECT
    store_id,
    week_start,
    total_sales,
    rolling_avg,
    rolling_std,
    z_score,
    CASE WHEN z_score > 2  THEN 'ABOVE_NORMAL'
         WHEN z_score < -2 THEN 'BELOW_NORMAL'
    END  AS flag
FROM zscore
WHERE ABS(z_score) > 2
ORDER BY week_start DESC, ABS(z_score) DESC;
💡 Key insight
ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING excludes the current row from the rolling window, making the z-score non-circular (comparing current week to a baseline computed entirely from prior data). The WHERE window_size >= 8 filter prevents false alarms for new stores with <8 weeks of history — a 2σ threshold with 2 data points is meaningless. NULLIF(rolling_std_12w, 0) handles the rare case where a store had identical sales every week (std = 0 → division by zero).
🔁 Common Follow-Up Questions
  • You find that 30% of stores are flagged as outliers in December every year. How would you adjust the model to account for seasonal spikes?
  • This query runs on a Hive table with 5 years of data across 4,000 stores. What partition strategy would you use to avoid a full scan?
Q2 — Inventory Turnover Ratio: Identify Slow-Moving SKUs
Medium⏱ 15 min

Compute the inventory turnover ratio (COGS / average inventory value) for each SKU in each store for last quarter. Flag SKUs with turnover ratio below the 10th percentile within their category — these are slow-moving items that need markdown or redistribution decisions.

Tables:

  • inventory_snapshots(snapshot_id, store_id, sku_id, category, inventory_value, snapshot_date)
  • sales_transactions(txn_id, store_id, sku_id, units_sold, cogs, txn_date)

WITH quarter_cogs AS (
    SELECT
        store_id,
        sku_id,
        SUM(cogs)  AS total_cogs
    FROM sales_transactions
    WHERE txn_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
      AND txn_date <  DATE_TRUNC('quarter', CURRENT_DATE)
    GROUP BY store_id, sku_id
),
avg_inventory AS (
    SELECT
        store_id,
        sku_id,
        category,
        AVG(inventory_value)  AS avg_inv_value
    FROM inventory_snapshots
    WHERE snapshot_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
      AND snapshot_date <  DATE_TRUNC('quarter', CURRENT_DATE)
    GROUP BY store_id, sku_id, category
),
turnover AS (
    SELECT
        a.store_id,
        a.sku_id,
        a.category,
        a.avg_inv_value,
        c.total_cogs,
        c.total_cogs / NULLIF(a.avg_inv_value, 0)  AS turnover_ratio
    FROM avg_inventory a
    LEFT JOIN quarter_cogs c USING (store_id, sku_id)
),
with_pctile AS (
    SELECT
        *,
        PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY turnover_ratio)
            OVER (PARTITION BY category)  AS p10_turnover
    FROM turnover
)
SELECT
    store_id,
    sku_id,
    category,
    ROUND(avg_inv_value, 2)  AS avg_inv_value,
    ROUND(total_cogs, 2)     AS total_cogs,
    ROUND(turnover_ratio, 3) AS turnover_ratio,
    ROUND(p10_turnover, 3)   AS category_p10,
    'SLOW_MOVING'            AS flag
FROM with_pctile
WHERE turnover_ratio < p10_turnover
   OR turnover_ratio IS NULL   -- zero sales = definitely slow-moving
ORDER BY category, turnover_ratio NULLS FIRST;
💡 Key insight
PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY ...) OVER (PARTITION BY category) computes the 10th percentile per category using the ordered-set aggregate syntax — this is cleaner than a self-join. SKUs with zero COGS (no sales) get turnover_ratio = NULL from the LEFT JOIN — they are explicitly flagged in the WHERE turnover_ratio IS NULL clause since they are the worst slow-movers (zero velocity). At Walmart, turnover below the 10th percentile triggers an automated markdown recommendation workflow.
🔁 Common Follow-Up Questions
  • A SKU is slow-moving in Store A but the top-selling item in Store B. How would you factor cross-store performance before recommending a markdown?
  • The inventory_snapshots table is 800 GB and is not partitioned. How would you restructure it to make this query efficient on Hive?


Round 2 — Walmart Data Engineer Interview Questions: Spark & Pipeline Design (60 mins)

PySpark & Data Pipeline Architecture Round
This round tests your hands-on Spark and pipeline engineering skills. Walmart’s data engineering team works with billion-row retail datasets, so you must know how to handle data skew, optimize Spark jobs for Hive tables, implement efficient incremental loads, and design pipeline SLAs for time-critical reporting (store managers need overnight reports by 6 AM every day).
Questions 4
Duration 60 minutes
Difficulty Medium–Hard
Key Topics Data SkewSpark OptimizationIncremental ETLPartitioningSCD Type 2
Q3 — PySpark: Handle Data Skew in Daily Sales Aggregation
Hard⏱ 25 min

You are aggregating daily sales by store_id and sku_id from a 2 billion row transaction table. Spark keeps getting OOM errors on 3 reducers that handle “Electronics” SKUs — a single SKU (iPhone 16) contributes 40% of all rows. Diagnose the problem and fix it using PySpark.

🎯 What the interviewer tests
Data skew is the #1 Spark performance question in Walmart interviews. The interviewer wants: (1) Diagnosis — identify skew by checking partition sizes or using df.groupBy('sku_id').count(); (2) Fix — salting the skewed key or using Spark’s adaptive query execution (AQE) with spark.sql.adaptive.skewJoin.enabled=true; (3) Trade-offs — salting requires a map-side de-salt step and can’t be used everywhere. Many candidates know AQE but don’t know salting.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5") \
    .config("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB") \
    .getOrCreate()

# Approach 1: Enable AQE (zero code change, handles skew at runtime)
# AQE splits skewed partitions automatically — best for simple aggregations

df = spark.read.parquet("s3://walmart-datalake/transactions/dt=2026-05-19/")

# Approach 2: Manual salting (when AQE is not enough or Spark < 3.0)
SALT = 20   # number of salt buckets

df_salted = df.withColumn(
    "salted_sku_id",
    F.concat(
        F.col("sku_id"),
        F.lit("_"),
        (F.rand() * SALT).cast("int").cast("string")
    )
)

agg_salted = df_salted.groupBy("store_id", "salted_sku_id", "txn_date") \
    .agg(
        F.sum("revenue").alias("revenue"),
        F.sum("units").alias("units"),
        F.sum("cogs").alias("cogs")
    )

# De-salt: strip the suffix and re-aggregate
agg_final = agg_salted \
    .withColumn("sku_id",
        F.split(F.col("salted_sku_id"), "_").getItem(0)
    ) \
    .groupBy("store_id", "sku_id", "txn_date") \
    .agg(
        F.sum("revenue").alias("total_revenue"),
        F.sum("units").alias("total_units"),
        F.sum("cogs").alias("total_cogs")
    )

agg_final.write.mode("overwrite") \
    .partitionBy("txn_date") \
    .parquet("s3://walmart-datalake/daily_sales_agg/")
💡 Key insight
Salting spreads one skewed key across SALT buckets — iPhone 16 rows get distributed across 20 tasks instead of 1. The cost: a second aggregation pass to recombine the 20 partial sums. AQE handles this automatically in Spark 3.0+ by detecting at runtime which partitions exceed the skew threshold and splitting them. Use AQE first (no code change); fall back to manual salting only if the skewed key is known in advance and AQE overhead is still causing timeouts in your SLA window.
🔁 Common Follow-Up Questions
  • You need to JOIN this sales table with a 500 MB product master table. Which join strategy (broadcast vs sort-merge) should you use and why?
  • The daily pipeline is running for 3 hours but has an SLA of 2 hours. What Spark metrics would you check first to identify the bottleneck?
Q4 — Design: Incremental ETL for Store Inventory with SCD Type 2
Medium⏱ 20 min

Walmart’s store inventory changes thousands of times per day (restock, markdown, shrink). Design an incremental ETL pipeline that loads only new/changed inventory records since the last run, and maintains SCD Type 2 history (so analysts can query “what was the inventory of SKU X in Store Y on a specific date”).

🎯 What the interviewer tests
This is the most common DE architecture question in Walmart interviews. The interviewer evaluates: (1) Do you understand SCD Type 2 — adding valid_from, valid_to, is_current columns? (2) Can you design an efficient change detection strategy — updated_at watermark or CDC (Change Data Capture) via Debezium? (3) Do you think about the MERGE/UPSERT operation — which DE tools support it (Delta Lake, Hudi, Iceberg) vs which don’t (plain Hive)?

Target table schema (SCD Type 2):

  • inventory_history(surrogate_key, store_id, sku_id, inventory_qty, unit_cost, valid_from, valid_to, is_current)
  • valid_to = NULL and is_current = TRUE for the active record

Incremental extraction (Airflow DAG, runs hourly):

  • Read last successful watermark from a pipeline_state table: SELECT max_updated_at FROM pipeline_state WHERE pipeline='inventory_etl'
  • Pull changed records: SELECT * FROM source_inventory WHERE updated_at > :watermark ORDER BY updated_at
  • Write to staging: stg_inventory_changes (no SCD logic yet, just raw deltas)

SCD Type 2 MERGE (using Delta Lake on Azure Data Lake):

from delta.tables import DeltaTable
from pyspark.sql import functions as F
from datetime import datetime

# Load current target and staged changes
target = DeltaTable.forPath(spark, "abfss://warehouse/inventory_history/")
staged = spark.table("stg_inventory_changes")

# Step 1: Close out changed records in target (set valid_to = now, is_current = False)
target.alias("tgt").merge(
    staged.alias("src"),
    "tgt.store_id = src.store_id AND tgt.sku_id = src.sku_id AND tgt.is_current = true"
).whenMatchedUpdate(
    condition = "tgt.inventory_qty != src.inventory_qty OR tgt.unit_cost != src.unit_cost",
    set = {
        "is_current": F.lit(False),
        "valid_to":   F.lit(datetime.utcnow())
    }
).execute()

# Step 2: Insert new records for changed rows (is_current = True)
new_records = staged.join(
    spark.table("inventory_history").filter("is_current = false AND valid_to >= current_timestamp() - interval 1 minute"),
    on=["store_id", "sku_id"],
    how="inner"
).select(
    F.monotonically_increasing_id().alias("surrogate_key"),
    "store_id", "sku_id", "inventory_qty", "unit_cost",
    F.current_timestamp().alias("valid_from"),
    F.lit(None).alias("valid_to"),
    F.lit(True).alias("is_current")
)
new_records.write.format("delta").mode("append").save("abfss://warehouse/inventory_history/")

# Step 3: Update watermark
spark.sql(f"UPDATE pipeline_state SET max_updated_at = '{datetime.utcnow()}' WHERE pipeline = 'inventory_etl'")
💡 Key insight
Delta Lake’s MERGE supports SCD Type 2 natively — it atomically closes old records and inserts new ones without a full table rewrite. Plain Hive requires a two-step OVERWRITE which has race conditions and no ACID guarantees. Walmart’s Walmart Labs team uses Delta Lake on Azure Data Lake for exactly this reason. The watermark update in Step 3 is idempotent — if the pipeline fails and retries, pulling records where updated_at > last_watermark will re-process recent changes safely (records that already exist in target will just trigger the “no update needed” branch of MERGE).
🔁 Common Follow-Up Questions
  • The source system doesn’t have an updated_at column. How do you detect changes for the incremental load? (Answer: CDC via Debezium reading the database transaction log)
  • How would a user query this SCD Type 2 table to find inventory as of a specific past date across all stores?


📚 Recommended Resources to Crack Walmart DE Interview

Used by 40,000+ engineers to crack data engineering and analytics interviews

⭐ Bestseller

2200 Most Asked Analytics Interview Questions
  • 2,200+ questions · 23 topics
  • SQL, Python, Data Engineering, Spark
  • Retail and e-commerce analytics patterns
For: DEs targeting retail, e-commerce, and product companies
₹1,999
₹7,999

Buy Now

🔥 SQL Focused

Ace Any SQL Interview
  • 220+ questions · SQL advanced, data models, DBMS
  • Window functions, CTEs, query optimization
  • Covers retail and supply chain query patterns
For: anyone who wants to crack any SQL round
₹799

Buy Now

4-Week Prep Plan: Crack Walmart Data Engineer Interview

WEEK 1

SQL + Spark Foundations + Walmart Product Context

Master window functions, rolling statistics, inventory analytics SQL. Study Spark architecture: RDDs vs DataFrames, partitioning, shuffles, AQE. Learn Walmart’s data landscape: POS data, supply chain, store hierarchy, e-commerce. Read TDM posts on data engineering interview patterns.

WEEK 2

2200 Most Asked Analytics Interview Questions — DE & SQL Sections

Cover SQL, Python, data engineering, and pipeline design chapters. Focus on Spark optimization, ETL patterns, and incremental load scenarios. Do 100+ questions per day. End of Week 2: First mock interview →

WEEK 3

Ace Any SQL Interview + Pipeline Architecture Deep Dive

Complete all 220+ questions in Ace Any SQL Interview. Practice 2 pipeline design problems per day — SCD Type 2, data skew handling, incremental ETL, Delta Lake MERGE. Study Airflow DAG design and error handling. End of Week 3: Second mock interview →

WEEK 4

Timed Practice + Final Revision

Revise first 5 chapters of 2200 ebook. Timed SQL: 15 min per query. Timed Spark design: 25 min per system design. Simulate Walmart’s 3-round pace. End of Week 4: Final mock interview →

📅 Book 1:1 mock — expert feedback from practitioners who’ve cracked DE interviews at tier-1 companies: Book Mock Interview
~950 questions + 3 mock interviews = enough to crack any Data Engineer role at a tier-1 retail or tech company. Or take the mentorship program below.

Scroll to Top