If you are targeting the Microsoft Senior Data Analyst role, this guide covers all four interview rounds β Advanced SQL, Python analytics, product case studies, and Power BI + Azure Synapse β with real questions, model answers, and exactly what each interviewer is evaluating. Questions are sourced from 2024β2026 interview experiences across Microsoft Bangalore.
Power BI
Python
Azure Synapse
DAX
Excel
Round 1 β Advanced SQL
| Questions | 5 |
| Duration | 60 minutes |
| Difficulty | MediumβHard |
| Key Topics | Window FunctionsSelf-JoinRolling AveragesGap DetectionT-SQL Syntax |
The product team wants a smoothed view of daily engagement for Microsoft Teams. Calculate the 7-day rolling average of daily active users to remove weekend dips.
Table: user_activity(activity_date, user_id)
ROWS BETWEEN vs RANGE BETWEEN? Most candidates write a subquery loop β the window function approach is 10Γ faster. They will also ask: “What does the result look like on the first 6 days?”WITH daily_dau AS (
SELECT
activity_date,
COUNT(DISTINCT user_id) AS dau
FROM user_activity
GROUP BY activity_date
)
SELECT
activity_date,
dau,
AVG(dau) OVER (
ORDER BY activity_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM daily_dau
ORDER BY activity_date;
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW includes exactly 7 rows (today + 6 prior days) regardless of gaps. Use ROWS not RANGE here β RANGE compares values, not row positions, which gives wrong results if dates are missing.- The first 6 days will have averages based on fewer than 7 data points β is that acceptable? How would you handle it?
- How would you calculate a 28-day rolling average alongside the 7-day one in the same query?
- How would you adapt this for a partitioned approach (per product, per region)?
A classic Microsoft interview question β find all employees whose salary exceeds their direct manager’s salary.
Table: employees(employee_id, name, salary, manager_id)
SELECT
e.name AS employee_name,
e.salary AS employee_salary,
m.name AS manager_name,
m.salary AS manager_salary
FROM employees e
JOIN employees m
ON e.manager_id = m.employee_id
WHERE e.salary > m.salary;
e (employee) and m (manager), then join on manager_id = employee_id. Use INNER JOIN here intentionally β employees with no manager (CEO) don’t have a comparison to make.
- 500+ SQL questions with solutions
- 300+ Python & stats questions
- Case study frameworks with 120+ examples
- Covers all top product companies
- 550 SQL questions across all difficulty levels
- T-SQL specific patterns & syntax
- Window functions, recursive CTEs, gaps
- Microsoft & enterprise SQL patterns
The finance team wants a month-by-month revenue report showing absolute growth and percentage change vs the prior month.
Table: orders(order_date DATE, revenue DECIMAL)
WITH monthly_rev AS (
SELECT
DATEADD(MONTH, DATEDIFF(MONTH, 0, order_date), 0) AS month_start,
SUM(revenue) AS total_revenue
FROM orders
GROUP BY DATEADD(MONTH, DATEDIFF(MONTH, 0, order_date), 0)
)
SELECT
month_start,
total_revenue,
LAG(total_revenue) OVER (ORDER BY month_start) AS prev_month_revenue,
ROUND(
100.0 * (total_revenue - LAG(total_revenue) OVER (ORDER BY month_start))
/ NULLIF(LAG(total_revenue) OVER (ORDER BY month_start), 0),
1
) AS mom_growth_pct
FROM monthly_rev
ORDER BY month_start;
NULLIF(..., 0) prevents division-by-zero when prior month revenue is zero. DATEDIFF(MONTH, 0, order_date) is the T-SQL idiom for month-start truncation β equivalent to DATE_TRUNC('month', ...) in PostgreSQL.- How would you show a 3-month moving average of growth % alongside the monthly figure?
- If a month has zero revenue (e.g. holiday shutdown), how should the % growth be displayed?
The telemetry pipeline occasionally misses days. Identify all dates between the first and last record date that have no data in the daily_metrics table.
Table: daily_metrics(metric_date DATE, value INT)
generate_series(). The recursive CTE approach is the expected answer.WITH date_range AS (
SELECT MIN(metric_date) AS start_date,
MAX(metric_date) AS end_date
FROM daily_metrics
),
all_dates AS (
-- Recursive CTE to generate every date in range
SELECT start_date AS dt FROM date_range
UNION ALL
SELECT DATEADD(DAY, 1, dt)
FROM all_dates
WHERE dt < (SELECT end_date FROM date_range)
)
SELECT a.dt AS missing_date
FROM all_dates a
LEFT JOIN daily_metrics d ON a.dt = d.metric_date
WHERE d.metric_date IS NULL
OPTION (MAXRECURSION 3650); -- allow up to 10 years
OPTION (MAXRECURSION 3650) hint is required β SQL Server defaults to 100 recursion levels, which only covers ~3 months. Always mention this to the interviewer β it shows production awareness.- How would you modify this to find gaps longer than 3 consecutive missing days?
- Could you use a Numbers table instead of a recursive CTE? Which is faster at scale?
The growth team wants to identify “power users” β defined as the top 10% of users by total session time in the past 30 days. Used for targeted feature rollout and referral campaigns.
Table: sessions(user_id, session_date DATE, duration_seconds INT)
WITH user_totals AS (
SELECT
user_id,
SUM(duration_seconds) AS total_seconds,
NTILE(10) OVER (ORDER BY SUM(duration_seconds) DESC) AS decile
FROM sessions
WHERE session_date >= DATEADD(DAY, -30, GETDATE())
GROUP BY user_id
)
SELECT
user_id,
total_seconds,
ROUND(total_seconds / 3600.0, 1) AS total_hours
FROM user_totals
WHERE decile = 1
ORDER BY total_seconds DESC;
NTILE(10) divides users into 10 equal buckets β decile 1 is the top 10%. This is cleaner than computing a manual percentile cutoff. GETDATE() is the T-SQL equivalent of CURRENT_DATE in standard SQL.
Round 2 β Python Analytics
| Questions | 4 |
| Duration | 45 minutes |
| Difficulty | MediumβHard |
| Key Topics | PandasA/B TestingRFM AnalysisJSON ParsingStatistical Validation |
Segment customers into Champions, At-Risk, and Churned using Recency, Frequency, and Monetary scores. Microsoft uses this for Azure subscription renewals and Office 365 upsell targeting.
import pandas as pd
from datetime import datetime
snapshot_date = datetime(2025, 5, 1)
rfm = (
df.groupby("customer_id").agg(
recency = ("order_date", lambda x: (snapshot_date - x.max()).days),
frequency = ("order_id", "nunique"),
monetary = ("revenue", "sum")
).reset_index()
)
# Score each dimension 1-5 (5 = best)
rfm["r_score"] = pd.qcut(rfm["recency"], 5, labels=[5,4,3,2,1])
rfm["f_score"] = pd.qcut(rfm["frequency"].rank(method="first"), 5, labels=[1,2,3,4,5])
rfm["m_score"] = pd.qcut(rfm["monetary"], 5, labels=[1,2,3,4,5])
rfm["rfm_score"] = (
rfm["r_score"].astype(int)
+ rfm["f_score"].astype(int)
+ rfm["m_score"].astype(int)
)
def segment(score):
if score >= 13: return "Champion"
if score >= 9: return "Loyal"
if score >= 6: return "At Risk"
return "Churned"
rfm["segment"] = rfm["rfm_score"].apply(segment)
print(rfm["segment"].value_counts())
pd.qcut creates equal-frequency bins β better than equal-width bins for skewed revenue data. Recency is inverted (lower days = better = score 5). Use rank(method="first") on frequency to handle ties in qcut.A new Teams onboarding flow (variant B) shows 5.2% conversion vs control’s 4.8%. With 10,000 users per group, is this lift statistically significant at 95% confidence?
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
n_control = 10_000
n_variant = 10_000
conv_ctrl = 0.048 # 4.8%
conv_var = 0.052 # 5.2%
count = np.array([conv_var * n_variant, conv_ctrl * n_control])
nobs = np.array([n_variant, n_control])
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')
print(f"Z-statistic : {z_stat:.3f}")
print(f"P-value : {p_value:.4f}")
print(f"Significant : {'Yes β' if p_value < 0.05 else 'No β'}")
# Minimum detectable effect sanity check
lift = (conv_var - conv_ctrl) / conv_ctrl * 100
print(f"Relative lift: {lift:.1f}%")
- How would you determine the minimum sample size needed before launching the test?
- What if the test ran for only 3 days during a holiday week β would you trust the results?
- How would you explain this result to a VP who doesn’t know what a p-value is?
Microsoft Graph API returns deeply nested JSON. Flatten the response into a clean DataFrame with one row per user event.
import pandas as pd
import json
# Simulated Graph API response
raw = '''
{
"value": [
{"userId": "u1", "displayName": "Alice",
"activities": [{"type": "login", "ts": "2025-05-01"},
{"type": "file_edit", "ts": "2025-05-02"}]},
{"userId": "u2", "displayName": "Bob",
"activities": [{"type": "login", "ts": "2025-05-01"}]}
]
}'''
data = json.loads(raw)
rows = []
for user in data["value"]:
for activity in user.get("activities", []):
rows.append({
"user_id" : user["userId"],
"name" : user["displayName"],
"event_type" : activity["type"],
"event_ts" : activity["ts"],
})
df = pd.DataFrame(rows)
df["event_ts"] = pd.to_datetime(df["event_ts"])
print(df)
pd.json_normalize for deeply nested structures β a manual loop gives more control and handles missing keys gracefully. Always parse timestamps with pd.to_datetime() immediately β raw strings cause bugs in every downstream operation.Calculate step-by-step funnel conversion rates for a Teams feature rollout. Steps: Invited β Clicked β Activated β Retained (used again in 7 days).
funnel = df.groupby("funnel_step")["user_id"].nunique().reset_index()
funnel.columns = ["step", "users"]
# Order steps manually
step_order = ["Invited", "Clicked", "Activated", "Retained"]
funnel["step"] = pd.Categorical(funnel["step"], categories=step_order, ordered=True)
funnel = funnel.sort_values("step").reset_index(drop=True)
top = funnel.loc[0, "users"]
funnel["pct_of_top"] = (funnel["users"] / top * 100).round(1)
funnel["step_dropoff"] = funnel["users"].pct_change().mul(100).round(1)
print(funnel[["step","users","pct_of_top","step_dropoff"]])
pd.Categorical with ordered=True ensures steps sort correctly regardless of alphabetical order.
Round 3 β Product Analytics Case Study
| Questions | 1β2 |
| Duration | 45 minutes |
| Difficulty | Hard |
| Key Topics | Metric DesignRoot Cause AnalysisMECE FrameworkExperiment Design |
Use the MECE framework β structure the problem before touching data:
- The drop is concentrated in mobile in India β what does that suggest and what would you investigate next?
- How would you set up an ongoing alerting system so this is caught within 2 hours instead of a week?
- How would you present your findings in a 5-minute slot to the Bing GM?
Round 4 β Power BI, DAX and Azure Synapse
| Questions | 3 |
| Duration | 60 minutes |
| Difficulty | MediumβHard |
| Key Topics | Power BI / DAXDirectQuery vs ImportAzure SynapseStar Schema |
When would you use a DAX calculated measure vs a calculated column in Power BI? Write both versions for “Revenue Per User” and explain the performance difference.
-- CALCULATED COLUMN (row context, stored in model)
-- Adds a physical column to the table β computed at refresh time
Revenue Per User Column =
DIVIDE(Sales[revenue], Sales[user_count], 0)
-- CALCULATED MEASURE (filter context, computed at query time)
-- Evaluated dynamically based on slicer/visual filters
Revenue Per User =
DIVIDE(
SUM(Sales[revenue]),
DISTINCTCOUNT(Sales[user_id]),
0
)
| Calculated Column | Calculated Measure | |
|---|---|---|
| Computed | At data refresh | At query time |
| Stored in | Model (increases file size) | Not stored |
| Context | Row context (per row) | Filter context (per visual) |
| Use when | Filtering/slicing by value | Aggregating across visuals |
| Performance | Slower refresh, fast query | Fast refresh, computed live |
Your Power BI dashboard serves 300 daily users on top of Azure Synapse. The dashboard data updates every 4 hours. Which connection mode would you choose and why?
| Factor | Import Mode | DirectQuery |
|---|---|---|
| Data freshness | Scheduled refresh (min 30 min) | Real-time (live query) |
| Performance | Sub-second (in-memory VertiPaq) | Depends on Synapse performance |
| Data size limit | 1 GB compressed (Premium: larger) | No limit |
| Synapse cost | One refresh query per cycle | 300 users Γ queries per session |
| DAX features | Full DAX support | Some DAX functions restricted |
| Best for | Historical, high-traffic dashboards | Real-time ops, very large data |
Design a data warehouse schema to track daily Office 365 feature usage (Word, Excel, Teams, Outlook) across tenants, enabling product adoption analytics.
-- FACT TABLE (grain = one row per user per product per day)
fact_usage (
usage_id BIGINT PRIMARY KEY,
date_key INT FK β dim_date,
user_key INT FK β dim_user,
product_key INT FK β dim_product,
tenant_key INT FK β dim_tenant,
sessions INT,
active_minutes INT,
files_created INT,
files_shared INT
)
-- DIMENSION: Date
dim_date (date_key, full_date, year, quarter, month, week, weekday, is_holiday)
-- DIMENSION: User (SCD Type 2 for role changes)
dim_user (user_key, user_id, email, department, role, country, is_current)
-- DIMENSION: Product
dim_product (product_key, product_name, product_family, tier)
-- DIMENSION: Tenant (customer organisation)
dim_tenant (tenant_key, tenant_id, industry, size_band, region, subscription_plan)
dim_user to track role changes over time (employees change departments). Always add a dim_tenant for multi-tenant SaaS products β it enables cohort analysis by customer segment.- How would you handle a user who belongs to two tenants simultaneously?
- How would you model a “feature flag” dimension β some users have Teams Copilot, others don’t?
- At what point would you consider a snowflake schema instead of a star schema?
4-Week Study Plan for Microsoft Senior DA Interview
Microsoft evaluates breadth + depth β you need both SQL/Python skills AND product thinking. This plan builds both in parallel.