Home β€Ί Business Intelligence Engineer β€Ί Walmart Data Engineer Interview Questions and Answers 2026
Business Intelligence Engineer Case Study Company Interview Questions Interview Questions

Walmart Data Engineer Interview Questions and Answers 2026

If you are targeting a Walmart Data Engineer role (Walmart Global Tech / Walmart Labs), you are interviewing for one of the world’s largest retail data engineering environments β€” 10,500+ stores, 240 million weekly customers, and petabytes of POS, e-commerce, and supply chain data. The Walmart Data Engineer Interview Questions test advanced SQL for retail scenarios, PySpark for large-scale ETL, Hive/HDFS architecture, Airflow pipeline orchestration, and your ability to handle massive data skew from high-GMV product categories. This guide covers all three rounds with exact questions, complete answers, and the frameworks Walmart interviewers score on.

Walmart
Data Engineer
Retail Data Engineering / Supply Chain Analytics
SQL
Python
Apache Spark
Hive
Airflow
HDFS
Salary Range
β‚Ή15–28 LPA
2–6 Years Exp

Round 1 β€” Walmart Data Engineer Interview Questions: SQL & Retail Analytics (60 mins)

SQL Round (Retail Data Focus)
Walmart’s SQL round tests your ability to work with retail data at scale β€” transactions, inventory, store hierarchies, supplier shipments. The interviewer looks for efficient query patterns: partitioning awareness, avoiding full scans, handling data skew from Tier 1 products (top 1% of SKUs drive 40% of GMV). You are expected to explain execution plans, not just write queries.
Questions 5
Duration 60 minutes
Difficulty Medium–Hard
Key Topics Inventory TurnoverStore PerformanceSupplier AnalyticsWindow FunctionsSlowly Changing Dims
Q1 β€” Store Sales Seasonality: Identify Week-Over-Week Outliers
Hard⏱ 20 min

The store operations team wants to flag stores whose weekly sales deviated by more than 2 standard deviations from their own 12-week rolling average. These are either exceptional performers (promotions success) or underperformers (supply issues). Return store, week, actual sales, rolling average, rolling std, z-score, and flag direction (ABOVE/BELOW).

Table:

  • store_weekly_sales(store_id, week_start, department, total_sales)
🎯 What the interviewer tests
Rolling statistics with window functions. The trap: AVG() OVER (... ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) includes the current week in the mean, which makes the z-score calculation slightly circular. The correct approach is to compute the rolling stats over the PRIOR 12 weeks (ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING) and then compare current week against them. Also: STDDEV_POP vs STDDEV_SAMP β€” the interviewer will ask which to use (STDDEV_SAMP for small n).

WITH rolling AS (
    SELECT
        store_id,
        week_start,
        total_sales,
        AVG(total_sales) OVER (
            PARTITION BY store_id
            ORDER BY week_start
            ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
        )  AS rolling_avg_12w,
        STDDEV_SAMP(total_sales) OVER (
            PARTITION BY store_id
            ORDER BY week_start
            ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
        )  AS rolling_std_12w,
        COUNT(*) OVER (
            PARTITION BY store_id
            ORDER BY week_start
            ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING
        )  AS window_size
    FROM store_weekly_sales
),
zscore AS (
    SELECT
        store_id,
        week_start,
        total_sales,
        ROUND(rolling_avg_12w, 2)                                       AS rolling_avg,
        ROUND(rolling_std_12w, 2)                                       AS rolling_std,
        ROUND((total_sales - rolling_avg_12w)
              / NULLIF(rolling_std_12w, 0), 2)                          AS z_score,
        window_size
    FROM rolling
    WHERE window_size >= 8    -- need at least 8 weeks of history for reliable stats
)
SELECT
    store_id,
    week_start,
    total_sales,
    rolling_avg,
    rolling_std,
    z_score,
    CASE WHEN z_score > 2  THEN 'ABOVE_NORMAL'
         WHEN z_score < -2 THEN 'BELOW_NORMAL'
    END  AS flag
FROM zscore
WHERE ABS(z_score) > 2
ORDER BY week_start DESC, ABS(z_score) DESC;
πŸ’‘ Key insight
ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING excludes the current row from the rolling window, making the z-score non-circular (comparing current week to a baseline computed entirely from prior data). The WHERE window_size >= 8 filter prevents false alarms for new stores with <8 weeks of history β€” a 2Οƒ threshold with 2 data points is meaningless. NULLIF(rolling_std_12w, 0) handles the rare case where a store had identical sales every week (std = 0 β†’ division by zero).
πŸ” Common Follow-Up Questions
  • You find that 30% of stores are flagged as outliers in December every year. How would you adjust the model to account for seasonal spikes?
  • This query runs on a Hive table with 5 years of data across 4,000 stores. What partition strategy would you use to avoid a full scan?
Q2 β€” Inventory Turnover Ratio: Identify Slow-Moving SKUs
Medium⏱ 15 min

Compute the inventory turnover ratio (COGS / average inventory value) for each SKU in each store for last quarter. Flag SKUs with turnover ratio below the 10th percentile within their category β€” these are slow-moving items that need markdown or redistribution decisions.

Tables:

  • inventory_snapshots(snapshot_id, store_id, sku_id, category, inventory_value, snapshot_date)
  • sales_transactions(txn_id, store_id, sku_id, units_sold, cogs, txn_date)

WITH quarter_cogs AS (
    SELECT
        store_id,
        sku_id,
        SUM(cogs)  AS total_cogs
    FROM sales_transactions
    WHERE txn_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
      AND txn_date <  DATE_TRUNC('quarter', CURRENT_DATE)
    GROUP BY store_id, sku_id
),
avg_inventory AS (
    SELECT
        store_id,
        sku_id,
        category,
        AVG(inventory_value)  AS avg_inv_value
    FROM inventory_snapshots
    WHERE snapshot_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
      AND snapshot_date <  DATE_TRUNC('quarter', CURRENT_DATE)
    GROUP BY store_id, sku_id, category
),
turnover AS (
    SELECT
        a.store_id,
        a.sku_id,
        a.category,
        a.avg_inv_value,
        c.total_cogs,
        c.total_cogs / NULLIF(a.avg_inv_value, 0)  AS turnover_ratio
    FROM avg_inventory a
    LEFT JOIN quarter_cogs c USING (store_id, sku_id)
),
with_pctile AS (
    SELECT
        *,
        PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY turnover_ratio)
            OVER (PARTITION BY category)  AS p10_turnover
    FROM turnover
)
SELECT
    store_id,
    sku_id,
    category,
    ROUND(avg_inv_value, 2)  AS avg_inv_value,
    ROUND(total_cogs, 2)     AS total_cogs,
    ROUND(turnover_ratio, 3) AS turnover_ratio,
    ROUND(p10_turnover, 3)   AS category_p10,
    'SLOW_MOVING'            AS flag
FROM with_pctile
WHERE turnover_ratio < p10_turnover
   OR turnover_ratio IS NULL   -- zero sales = definitely slow-moving
ORDER BY category, turnover_ratio NULLS FIRST;
πŸ’‘ Key insight
PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY ...) OVER (PARTITION BY category) computes the 10th percentile per category using the ordered-set aggregate syntax β€” this is cleaner than a self-join. SKUs with zero COGS (no sales) get turnover_ratio = NULL from the LEFT JOIN β€” they are explicitly flagged in the WHERE turnover_ratio IS NULL clause since they are the worst slow-movers (zero velocity). At Walmart, turnover below the 10th percentile triggers an automated markdown recommendation workflow.
πŸ” Common Follow-Up Questions
  • A SKU is slow-moving in Store A but the top-selling item in Store B. How would you factor cross-store performance before recommending a markdown?
  • The inventory_snapshots table is 800 GB and is not partitioned. How would you restructure it to make this query efficient on Hive?


Round 2 β€” Walmart Data Engineer Interview Questions: Spark & Pipeline Design (60 mins)

PySpark & Data Pipeline Architecture Round
This round tests your hands-on Spark and pipeline engineering skills. Walmart’s data engineering team works with billion-row retail datasets, so you must know how to handle data skew, optimize Spark jobs for Hive tables, implement efficient incremental loads, and design pipeline SLAs for time-critical reporting (store managers need overnight reports by 6 AM every day).
Questions 4
Duration 60 minutes
Difficulty Medium–Hard
Key Topics Data SkewSpark OptimizationIncremental ETLPartitioningSCD Type 2
Q3 β€” PySpark: Handle Data Skew in Daily Sales Aggregation
Hard⏱ 25 min

You are aggregating daily sales by store_id and sku_id from a 2 billion row transaction table. Spark keeps getting OOM errors on 3 reducers that handle “Electronics” SKUs β€” a single SKU (iPhone 16) contributes 40% of all rows. Diagnose the problem and fix it using PySpark.

🎯 What the interviewer tests
Data skew is the #1 Spark performance question in Walmart interviews. The interviewer wants: (1) Diagnosis β€” identify skew by checking partition sizes or using df.groupBy('sku_id').count(); (2) Fix β€” salting the skewed key or using Spark’s adaptive query execution (AQE) with spark.sql.adaptive.skewJoin.enabled=true; (3) Trade-offs β€” salting requires a map-side de-salt step and can’t be used everywhere. Many candidates know AQE but don’t know salting.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5") \
    .config("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB") \
    .getOrCreate()

# Approach 1: Enable AQE (zero code change, handles skew at runtime)
# AQE splits skewed partitions automatically β€” best for simple aggregations

df = spark.read.parquet("s3://walmart-datalake/transactions/dt=2026-05-19/")

# Approach 2: Manual salting (when AQE is not enough or Spark < 3.0)
SALT = 20   # number of salt buckets

df_salted = df.withColumn(
    "salted_sku_id",
    F.concat(
        F.col("sku_id"),
        F.lit("_"),
        (F.rand() * SALT).cast("int").cast("string")
    )
)

agg_salted = df_salted.groupBy("store_id", "salted_sku_id", "txn_date") \
    .agg(
        F.sum("revenue").alias("revenue"),
        F.sum("units").alias("units"),
        F.sum("cogs").alias("cogs")
    )

# De-salt: strip the suffix and re-aggregate
agg_final = agg_salted \
    .withColumn("sku_id",
        F.split(F.col("salted_sku_id"), "_").getItem(0)
    ) \
    .groupBy("store_id", "sku_id", "txn_date") \
    .agg(
        F.sum("revenue").alias("total_revenue"),
        F.sum("units").alias("total_units"),
        F.sum("cogs").alias("total_cogs")
    )

agg_final.write.mode("overwrite") \
    .partitionBy("txn_date") \
    .parquet("s3://walmart-datalake/daily_sales_agg/")
πŸ’‘ Key insight
Salting spreads one skewed key across SALT buckets β€” iPhone 16 rows get distributed across 20 tasks instead of 1. The cost: a second aggregation pass to recombine the 20 partial sums. AQE handles this automatically in Spark 3.0+ by detecting at runtime which partitions exceed the skew threshold and splitting them. Use AQE first (no code change); fall back to manual salting only if the skewed key is known in advance and AQE overhead is still causing timeouts in your SLA window.
πŸ” Common Follow-Up Questions
  • You need to JOIN this sales table with a 500 MB product master table. Which join strategy (broadcast vs sort-merge) should you use and why?
  • The daily pipeline is running for 3 hours but has an SLA of 2 hours. What Spark metrics would you check first to identify the bottleneck?
Q4 β€” Design: Incremental ETL for Store Inventory with SCD Type 2
Medium⏱ 20 min

Walmart’s store inventory changes thousands of times per day (restock, markdown, shrink). Design an incremental ETL pipeline that loads only new/changed inventory records since the last run, and maintains SCD Type 2 history (so analysts can query “what was the inventory of SKU X in Store Y on a specific date”).

🎯 What the interviewer tests
This is the most common DE architecture question in Walmart interviews. The interviewer evaluates: (1) Do you understand SCD Type 2 β€” adding valid_from, valid_to, is_current columns? (2) Can you design an efficient change detection strategy β€” updated_at watermark or CDC (Change Data Capture) via Debezium? (3) Do you think about the MERGE/UPSERT operation β€” which DE tools support it (Delta Lake, Hudi, Iceberg) vs which don’t (plain Hive)?

Target table schema (SCD Type 2):

  • inventory_history(surrogate_key, store_id, sku_id, inventory_qty, unit_cost, valid_from, valid_to, is_current)
  • valid_to = NULL and is_current = TRUE for the active record

Incremental extraction (Airflow DAG, runs hourly):

  • Read last successful watermark from a pipeline_state table: SELECT max_updated_at FROM pipeline_state WHERE pipeline='inventory_etl'
  • Pull changed records: SELECT * FROM source_inventory WHERE updated_at > :watermark ORDER BY updated_at
  • Write to staging: stg_inventory_changes (no SCD logic yet, just raw deltas)

SCD Type 2 MERGE (using Delta Lake on Azure Data Lake):

from delta.tables import DeltaTable
from pyspark.sql import functions as F
from datetime import datetime

# Load current target and staged changes
target = DeltaTable.forPath(spark, "abfss://warehouse/inventory_history/")
staged = spark.table("stg_inventory_changes")

# Step 1: Close out changed records in target (set valid_to = now, is_current = False)
target.alias("tgt").merge(
    staged.alias("src"),
    "tgt.store_id = src.store_id AND tgt.sku_id = src.sku_id AND tgt.is_current = true"
).whenMatchedUpdate(
    condition = "tgt.inventory_qty != src.inventory_qty OR tgt.unit_cost != src.unit_cost",
    set = {
        "is_current": F.lit(False),
        "valid_to":   F.lit(datetime.utcnow())
    }
).execute()

# Step 2: Insert new records for changed rows (is_current = True)
new_records = staged.join(
    spark.table("inventory_history").filter("is_current = false AND valid_to >= current_timestamp() - interval 1 minute"),
    on=["store_id", "sku_id"],
    how="inner"
).select(
    F.monotonically_increasing_id().alias("surrogate_key"),
    "store_id", "sku_id", "inventory_qty", "unit_cost",
    F.current_timestamp().alias("valid_from"),
    F.lit(None).alias("valid_to"),
    F.lit(True).alias("is_current")
)
new_records.write.format("delta").mode("append").save("abfss://warehouse/inventory_history/")

# Step 3: Update watermark
spark.sql(f"UPDATE pipeline_state SET max_updated_at = '{datetime.utcnow()}' WHERE pipeline = 'inventory_etl'")
πŸ’‘ Key insight
Delta Lake’s MERGE supports SCD Type 2 natively β€” it atomically closes old records and inserts new ones without a full table rewrite. Plain Hive requires a two-step OVERWRITE which has race conditions and no ACID guarantees. Walmart’s Walmart Labs team uses Delta Lake on Azure Data Lake for exactly this reason. The watermark update in Step 3 is idempotent β€” if the pipeline fails and retries, pulling records where updated_at > last_watermark will re-process recent changes safely (records that already exist in target will just trigger the “no update needed” branch of MERGE).
πŸ” Common Follow-Up Questions
  • The source system doesn’t have an updated_at column. How do you detect changes for the incremental load? (Answer: CDC via Debezium reading the database transaction log)
  • How would a user query this SCD Type 2 table to find inventory as of a specific past date across all stores?


πŸ“š Recommended Resources to Crack Walmart DE Interview

Used by 40,000+ engineers to crack data engineering and analytics interviews

⭐ Bestseller

2200 Most Asked Analytics Interview Questions
  • 2,200+ questions Β· 23 topics
  • SQL, Python, Data Engineering, Spark
  • Retail and e-commerce analytics patterns
For: DEs targeting retail, e-commerce, and product companies
β‚Ή1,999
β‚Ή7,999

Buy Now

πŸ”₯ SQL Focused

Ace Any SQL Interview
  • 220+ questions Β· SQL advanced, data models, DBMS
  • Window functions, CTEs, query optimization
  • Covers retail and supply chain query patterns
For: anyone who wants to crack any SQL round
β‚Ή799

Buy Now

4-Week Prep Plan: Crack Walmart Data Engineer Interview

WEEK 1

SQL + Spark Foundations + Walmart Product Context

Master window functions, rolling statistics, inventory analytics SQL. Study Spark architecture: RDDs vs DataFrames, partitioning, shuffles, AQE. Learn Walmart’s data landscape: POS data, supply chain, store hierarchy, e-commerce. Read TDM posts on data engineering interview patterns.

WEEK 2

2200 Most Asked Analytics Interview Questions β€” DE & SQL Sections

Cover SQL, Python, data engineering, and pipeline design chapters. Focus on Spark optimization, ETL patterns, and incremental load scenarios. Do 100+ questions per day. End of Week 2: First mock interview β†’

WEEK 3

Ace Any SQL Interview + Pipeline Architecture Deep Dive

Complete all 220+ questions in Ace Any SQL Interview. Practice 2 pipeline design problems per day β€” SCD Type 2, data skew handling, incremental ETL, Delta Lake MERGE. Study Airflow DAG design and error handling. End of Week 3: Second mock interview β†’

WEEK 4

Timed Practice + Final Revision

Revise first 5 chapters of 2200 ebook. Timed SQL: 15 min per query. Timed Spark design: 25 min per system design. Simulate Walmart’s 3-round pace. End of Week 4: Final mock interview β†’

πŸ“… Book 1:1 mock β€” expert feedback from practitioners who’ve cracked DE interviews at tier-1 companies: Book Mock Interview
~950 questions + 3 mock interviews = enough to crack any Data Engineer role at a tier-1 retail or tech company. Or take the mentorship program below.

Scroll to Top