The Google Senior Business Intelligence Analyst role in Bangalore is one of the most data-intensive analyst positions in the industry. Google expects you to go beyond dashboards β they test your ability to drive product decisions using BigQuery SQL, Looker/LookML, Python, and dbt, while applying deep product intuition to YouTube, Search, Ads, and Maps data. This guide covers all four rounds with real questions from 2025β2026 interview experiences.
Looker
Python
dbt
Data Studio
SQL
Round 1 β BigQuery SQL
| Questions | 5 |
| Duration | 60 minutes |
| Difficulty | MediumβHard |
| Key Topics | Window FunctionsRetention CohortsFunnel AnalysisBot DetectionBigQuery Syntax |
The growth team wants to understand how well each weekly signup cohort retains over 30 days. For each signup week, compute the % of users who return and perform at least one action on Day 7, Day 14, and Day 30.
Tables:
- users(user_id, signup_date)
- events(user_id, event_date, event_type)
DATE_DIFF correctly for cohort offset and whether you pivot retention across multiple day windows without running three separate subqueries. Google interviewers also ask: “How would you partition this table for cost efficiency?” Mentioning _PARTITIONTIME or event_date partitioning immediately signals production BigQuery experience.WITH cohorts AS (
SELECT
u.user_id,
DATE_TRUNC(u.signup_date, WEEK) AS cohort_week,
e.event_date,
DATE_DIFF(e.event_date, u.signup_date, DAY) AS days_since_signup
FROM users u
JOIN events e USING (user_id)
WHERE e.event_date BETWEEN u.signup_date
AND DATE_ADD(u.signup_date, INTERVAL 30 DAY)
),
cohort_sizes AS (
SELECT cohort_week, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts
GROUP BY cohort_week
)
SELECT
c.cohort_week,
cs.cohort_size,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN days_since_signup BETWEEN 6 AND 8 THEN c.user_id END) / cs.cohort_size, 1) AS day7_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN days_since_signup BETWEEN 13 AND 15 THEN c.user_id END) / cs.cohort_size, 1) AS day14_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN days_since_signup BETWEEN 29 AND 31 THEN c.user_id END) / cs.cohort_size, 1) AS day30_pct
FROM cohorts c
JOIN cohort_sizes cs USING (cohort_week)
GROUP BY c.cohort_week, cs.cohort_size
ORDER BY c.cohort_week;
DATE_TRUNC(..., WEEK) groups cohorts by Monday-start weeks. Always join back to cohort_sizes β otherwise the denominator silently excludes users who never returned.- How would you compute this if the events table has 10 billion rows β what partition strategy would you use?
- Day-30 retention is 8% but Day-7 is 40%. What does that shape tell you about the product?
- How would you visualise this cohort table in Looker as a heatmap?
The ads team needs to identify their top-decile advertisers by total spend in the last 90 days and show each advertiser’s rank, spend, and the % of total platform spend they account for.
Tables:
- ad_spend(advertiser_id, campaign_date, spend_usd)
NTILE(10) or PERCENT_RANK() and can explain the difference. They also want you to compute share-of-wallet (each advertiser’s % of total spend) β a common metric in Google Ads analytics. Expect: “What if an advertiser spent across 3 separate accounts β how do you handle that?”WITH spend_90d AS (
SELECT
advertiser_id,
SUM(spend_usd) AS total_spend
FROM ad_spend
WHERE campaign_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY advertiser_id
),
ranked AS (
SELECT
advertiser_id,
total_spend,
RANK() OVER (ORDER BY total_spend DESC) AS spend_rank,
NTILE(10) OVER (ORDER BY total_spend DESC) AS decile,
ROUND(100.0 * total_spend / SUM(total_spend) OVER (), 3) AS pct_of_total
FROM spend_90d
)
SELECT
advertiser_id,
total_spend,
spend_rank,
pct_of_total
FROM ranked
WHERE decile = 1
ORDER BY total_spend DESC;
SUM(total_spend) OVER () without a PARTITION BY computes the grand total across all rows β use this for share-of-wallet. NTILE(10) is the right choice here: it guarantees exactly 10% of advertisers in each bucket, whereas PERCENT_RANK() gives a continuous 0β1 score and doesn’t naturally create equally-sized buckets.
- 500+ SQL questions with solutions
- 300+ Python & stats questions
- Case study frameworks with 120+ examples
- Covers all top product companies
- 550 SQL questions across all difficulty levels
- BigQuery-compatible SQL patterns
- Window functions, cohort analysis, funnels
- Google-scale data problems
The YouTube team wants to identify which creator categories are growing watch-time fastest month-over-month. Compute MoM growth % for each category and flag categories with >20% MoM decline.
Tables:
- video_views(video_id, view_date, watch_seconds)
- videos(video_id, creator_id, category)
WITH monthly_watch AS (
SELECT
v.category,
DATE_TRUNC(vv.view_date, MONTH) AS month_start,
SUM(vv.watch_seconds) / 3600.0 AS total_watch_hours
FROM video_views vv
JOIN videos v ON vv.video_id = v.video_id
GROUP BY v.category, month_start
),
with_lag AS (
SELECT
category,
month_start,
total_watch_hours,
LAG(total_watch_hours) OVER (
PARTITION BY category
ORDER BY month_start
) AS prev_month_hours
FROM monthly_watch
)
SELECT
category,
month_start,
ROUND(total_watch_hours, 0) AS watch_hours,
ROUND(
100.0 * (total_watch_hours - prev_month_hours)
/ NULLIF(prev_month_hours, 0), 1
) AS mom_growth_pct,
CASE
WHEN (total_watch_hours - prev_month_hours) / NULLIF(prev_month_hours, 0) < -0.20
THEN 'β οΈ Declining'
ELSE 'OK'
END AS status
FROM with_lag
WHERE prev_month_hours IS NOT NULL
ORDER BY month_start DESC, mom_growth_pct ASC;
LAG() OVER (PARTITION BY category ORDER BY month_start) computes the prior-month value within each category β no self-join needed. Convert watch_seconds to hours before aggregating to keep numbers business-readable. The NULLIF guard is critical: a new category in month 2 with zero prior data would cause division by zero without it.- Gaming category dropped 25% in March β what external factors would you check first?
- How would you weight watch-time growth by creator count to avoid small categories skewing the ranking?
- How would you expose this data in Looker so category managers can self-serve the analysis?
The Search team wants to understand click-through rate (CTR = clicks / impressions) segmented by result position (1β10) and device type. Identify positionβdevice combinations where CTR is statistically below the platform average.
Table:
- search_results(query_date, query_id, position, device_type, impressions, clicks)
SUM(clicks)/SUM(impressions) rather than AVG(clicks/impressions) β a common error that gives incorrect results when impressions vary widely. They also test whether you can identify “below average” segments using a window function global average, rather than a hardcoded threshold.WITH ctr_by_segment AS (
SELECT
position,
device_type,
SUM(clicks) AS total_clicks,
SUM(impressions) AS total_impressions,
SAFE_DIVIDE(SUM(clicks), SUM(impressions)) AS segment_ctr
FROM search_results
WHERE query_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY position, device_type
),
with_platform_avg AS (
SELECT
*,
AVG(segment_ctr) OVER () AS platform_avg_ctr
FROM ctr_by_segment
)
SELECT
position,
device_type,
ROUND(segment_ctr * 100, 2) AS ctr_pct,
ROUND(platform_avg_ctr * 100, 2) AS platform_avg_pct,
ROUND((segment_ctr - platform_avg_ctr) * 100, 2) AS ctr_gap_pp,
CASE
WHEN segment_ctr < platform_avg_ctr * 0.8 THEN 'Under-performing'
ELSE 'Within range'
END AS performance_flag
FROM with_platform_avg
ORDER BY ctr_gap_pp ASC;
SAFE_DIVIDE is the BigQuery-idiomatic way to handle division by zero β it returns NULL instead of throwing an error, which is safer than NULLIF in a large production pipeline. Reporting CTR gap in percentage points (pp) is more actionable than relative lift β a PM can directly see “mobile position 4 is 3.2pp below average” and prioritise the fix.Flag user sessions that show bot-like behaviour β defined as: more than 3 standard deviations above the mean session event count, or more than 500 events in a session, or less than 1 second average time between events.
Table:
- session_events(session_id, user_id, event_ts TIMESTAMP, event_type)
WITH session_stats AS (
SELECT
session_id,
user_id,
COUNT(*) AS event_count,
TIMESTAMP_DIFF(MAX(event_ts), MIN(event_ts), SECOND) AS session_duration_secs,
SAFE_DIVIDE(
TIMESTAMP_DIFF(MAX(event_ts), MIN(event_ts), SECOND),
NULLIF(COUNT(*) - 1, 0)
) AS avg_secs_between_events
FROM session_events
GROUP BY session_id, user_id
),
platform_stats AS (
SELECT
AVG(event_count) AS mean_events,
STDDEV(event_count) AS std_events
FROM session_stats
)
SELECT
s.session_id,
s.user_id,
s.event_count,
s.avg_secs_between_events,
ROUND((s.event_count - p.mean_events) / NULLIF(p.std_events, 0), 2) AS z_score,
CASE
WHEN s.event_count > 500 THEN 'rule:high_volume'
WHEN s.avg_secs_between_events < 1 THEN 'rule:too_fast'
WHEN (s.event_count - p.mean_events) / NULLIF(p.std_events, 0) > 3 THEN 'stat:3sigma'
ELSE 'clean'
END AS bot_flag
FROM session_stats s
CROSS JOIN platform_stats p
WHERE
s.event_count > 500
OR s.avg_secs_between_events < 1
OR (s.event_count - p.mean_events) / NULLIF(p.std_events, 0) > 3
ORDER BY z_score DESC;
CROSS JOIN platform_stats β a single-row CTE β efficiently broadcasts the global mean and stddev to every session row without a subquery per row. Labelling the bot reason (rule vs stat) is critical for downstream analysis: rule-based flags catch obvious bots; stat-based flags catch sophisticated bots that stay under fixed thresholds but are still statistical outliers.- What precision/recall tradeoff would you accept for ad fraud detection vs content recommendation?
- How would you handle a legitimate power user (e.g., a developer testing an app) who triggers the 3Ο flag?
- How would you run this as an incremental dbt model on daily partitions instead of a full scan?
Round 2 β Python Analytics
| Questions | 4 |
| Duration | 45 minutes |
| Difficulty | MediumβHard |
| Key Topics | A/B TestingForecastingSegmentationPandasStatsmodels |
YouTube tested a new Shorts recommendation algorithm (variant B) on 200,000 users per group. Variant B shows 4.8% 7-day retention vs control’s 4.3%. Evaluate statistical significance and practical impact at Google scale.
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
n = 200_000
conv_ctrl = 0.043 # 4.3%
conv_var = 0.048 # 4.8%
counts = np.array([conv_var * n, conv_ctrl * n])
nobs = np.array([n, n])
z, p = proportions_ztest(counts, nobs, alternative='larger')
print(f"Z-stat : {z:.3f}")
print(f"P-value : {p:.6f}")
print(f"Result : {'Significant β' if p < 0.05 else 'Not significant β'}")
# Relative and absolute lift
abs_lift = conv_var - conv_ctrl
rel_lift = abs_lift / conv_ctrl * 100
print(f"\nAbsolute lift : {abs_lift*100:.1f}pp")
print(f"Relative lift : {rel_lift:.1f}%")
# Google-scale business impact
monthly_users = 1_000_000_000
retained_delta = abs_lift * monthly_users
print(f"\nAt Google scale (1B users):")
print(f" Additional retained users/month : {retained_delta:,.0f}")
print(f" Estimated ARR impact (βΉ12/user) : βΉ{retained_delta * 12 / 1e9:.2f}B")
- How would you check if the lift is driven by novelty effect (week 1 spike that fades)?
- The algorithm change affects content recommendations β could this create a filter bubble concern?
- How would you compute the minimum detectable effect before designing the experiment?
The ads finance team needs a 30-day forward forecast of daily revenue. Use the last 180 days of data with proper seasonality handling (weekly + monthly cycles).
Prophet (the right tool) or try to use a simple moving average (wrong for seasonal data). They also test whether you can quantify forecast uncertainty with confidence intervals β a PM needs to know the range, not just the point estimate. Google additionally asks: “How would you detect when the forecast has gone stale?”import pandas as pd
from prophet import Prophet
# Prepare data in Prophet format
df_prophet = df[["date", "revenue"]].rename(
columns={"date": "ds", "revenue": "y"}
)
# Build model with weekly + yearly seasonality
model = Prophet(
weekly_seasonality=True,
yearly_seasonality=True,
changepoint_prior_scale=0.05, # conservative β avoids overfitting
interval_width=0.90 # 90% confidence interval
)
model.fit(df_prophet)
# Generate 30-day forecast
future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)
# Extract key columns
result = forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].tail(30)
result.columns = ["date", "forecast", "lower_90ci", "upper_90ci"]
# Sanity check: flag days where range is > 40% of forecast (high uncertainty)
result["uncertainty_flag"] = (
(result["upper_90ci"] - result["lower_90ci"]) / result["forecast"] > 0.40
)
print(result.head(10))
changepoint_prior_scale=0.05 makes Prophet conservative about detecting trend changes β useful for ad revenue which has gradual, not sudden, trend shifts. The 90% CI rather than 95% is intentional for planning: you want a tighter range that’s still credible. Always flag high-uncertainty days β they indicate upcoming holidays or events where the model has little historical signal.Segment Google Maps users into Heavy, Moderate, and Light users based on weekly session frequency and navigation distance, to prioritise which segment to target for a new offline maps feature.
import pandas as pd
import numpy as np
# Compute weekly summary per user
user_weekly = (
df.groupby("user_id")
.agg(
avg_weekly_sessions = ("session_id", "nunique"),
avg_nav_km = ("navigation_km", "mean"),
total_searches = ("search_query", "count")
)
.reset_index()
)
# Normalise each feature to 0-1 for scoring
for col in ["avg_weekly_sessions", "avg_nav_km", "total_searches"]:
user_weekly[f"{col}_norm"] = (
(user_weekly[col] - user_weekly[col].min())
/ (user_weekly[col].max() - user_weekly[col].min())
)
# Composite engagement score (equal weights)
user_weekly["engagement_score"] = (
user_weekly["avg_weekly_sessions_norm"] * 0.40
+ user_weekly["avg_nav_km_norm"] * 0.35
+ user_weekly["total_searches_norm"] * 0.25
)
# Segment using percentile cutoffs
user_weekly["segment"] = pd.cut(
user_weekly["engagement_score"],
bins=[0, 0.33, 0.66, 1.0],
labels=["Light", "Moderate", "Heavy"],
include_lowest=True
)
print(user_weekly["segment"].value_counts(normalize=True).mul(100).round(1))
- How would you validate that “Heavy” users are genuinely high-value, not just high-frequency bots?
- The offline maps feature requires storage space β how would you factor device storage into the targeting model?
Compute Google Pay market penetration (active users / total smartphone users) per Indian state, identify states where penetration is below 20%, and calculate the potential user uplift if they reached the national average.
import pandas as pd
# df: state | active_users | smartphone_users
df["penetration_pct"] = (df["active_users"] / df["smartphone_users"] * 100).round(2)
national_avg = (df["active_users"].sum() / df["smartphone_users"].sum() * 100)
print(f"National penetration: {national_avg:.1f}%")
low_pen = df[df["penetration_pct"] < 20].copy()
low_pen["target_users"] = (national_avg / 100 * low_pen["smartphone_users"]).astype(int)
low_pen["uplift_users"] = low_pen["target_users"] - low_pen["active_users"]
print(
low_pen[["state","penetration_pct","uplift_users"]]
.sort_values("uplift_users", ascending=False)
.to_string(index=False)
)
print(f"\nTotal uplift potential: {low_pen['uplift_users'].sum():,} users")
total_active / total_smartphones (weighted), not mean(state_penetrations) (unweighted) β small states would distort the unweighted average. The uplift calculation shows the business value of market expansion in absolute user numbers, which is what growth teams budget against.
Round 3 β Product Analytics Case Study
| Questions | 1β2 |
| Duration | 45 minutes |
| Difficulty | Hard |
| Key Topics | Root Cause AnalysisMetric DesignMECE FrameworkCross-Product Effects |
- You find the drop is entirely in Android users on app version 11.2. What do you do next?
- How would you build a real-time alert that catches an 18% DAU drop within 4 hours?
- How would you present this investigation to the Maps India GM in a 3-slide deck?
Round 4 β BI Tools & Data Modeling
| Questions | 3 |
| Duration | 60 minutes |
| Difficulty | MediumβHard |
| Key Topics | LookML / Lookerdbt IncrementalStar SchemaBigQuery Optimization |
A YouTube creator partnership manager needs a self-serve Looker dashboard to track top creators by watch-time, subscriber growth, and revenue per creator category. Design the LookML model and explain the key design decisions.
# LookML View: creators
view: creators {
sql_table_name: `analytics.dim_creators` ;;
dimension: creator_id {
type: string
primary_key: yes
sql: ${TABLE}.creator_id ;;
}
dimension: creator_name { type: string; sql: ${TABLE}.creator_name ;; }
dimension: primary_category { type: string; sql: ${TABLE}.primary_category ;; }
dimension: country { type: string; sql: ${TABLE}.country ;; }
dimension: is_partner { type: yesno; sql: ${TABLE}.is_partner ;; }
}
# LookML View: video_performance (fact)
view: video_performance {
sql_table_name: `analytics.fact_video_daily` ;;
dimension_group: report {
type: time
timeframes: [date, week, month, quarter]
sql: ${TABLE}.report_date ;;
}
measure: total_watch_hours {
type: sum
sql: ${TABLE}.watch_seconds / 3600.0 ;;
value_format: "#,##0.0"
}
measure: total_revenue {
type: sum
sql: ${TABLE}.revenue_usd ;;
value_format: "$#,##0.00"
}
measure: subscriber_net_gain {
type: sum
sql: ${TABLE}.subscribers_gained - ${TABLE}.subscribers_lost ;;
}
measure: revenue_per_watch_hour {
type: number
sql: ${total_revenue} / NULLIF(${total_watch_hours}, 0) ;;
value_format: "$0.00"
}
}
# Explore: joins creators to performance
explore: video_performance {
label: "YouTube Creator Performance"
join: creators {
type: left_outer
sql_on: ${video_performance.creator_id} = ${creators.creator_id} ;;
relationship: many_to_one
}
}
many_to_one relationship when joining a fact table to a dimension β this tells Looker not to fan out the aggregation. The revenue_per_watch_hour derived measure uses Looker’s ${measure} references rather than raw SQL β this ensures it automatically respects all active filters and drill-down dimensions. Using a separate dim_creators table rather than the raw creators table gives you control over which fields to expose and allows adding business-logic columns (like is_partner) without modifying the source system.Design a BigQuery data warehouse schema to support daily Google Ads performance reporting β clicks, impressions, conversions, cost β across campaigns, ad groups, keywords, and geographic targets.
-- FACT TABLE (grain: keyword Γ ad_group Γ date)
-- BigQuery: PARTITION BY report_date, CLUSTER BY advertiser_id
CREATE TABLE fact_ads_performance (
report_date DATE, -- partition key
advertiser_id STRING, -- cluster key
campaign_key INT64, -- FK β dim_campaign
ad_group_key INT64, -- FK β dim_ad_group
keyword_key INT64, -- FK β dim_keyword (SCD Type 2)
geo_key INT64, -- FK β dim_geography
device_type STRING, -- mobile | desktop | tablet
impressions INT64,
clicks INT64,
conversions INT64,
cost_usd FLOAT64,
-- Derived metrics (pre-computed for Looker performance)
ctr FLOAT64, -- clicks / impressions
cpc FLOAT64, -- cost / clicks
conversion_rate FLOAT64 -- conversions / clicks
);
-- DIM: Campaign
CREATE TABLE dim_campaign (
campaign_key INT64,
campaign_id STRING,
campaign_name STRING,
campaign_type STRING, -- search | display | shopping | video
bidding_strategy STRING,
daily_budget_usd FLOAT64
);
-- DIM: Keyword (SCD Type 2 β tracks bid and match type changes)
CREATE TABLE dim_keyword (
keyword_key INT64, -- surrogate key
keyword_id STRING, -- natural key
keyword_text STRING,
match_type STRING, -- broad | phrase | exact
bid_usd FLOAT64,
valid_from DATE,
valid_to DATE,
is_current BOOL
);
-- DIM: Geography
CREATE TABLE dim_geography (
geo_key INT64,
country STRING,
state STRING,
city STRING,
geo_tier STRING -- tier1 | tier2 | tier3
);
PARTITION BY report_date CLUSTER BY advertiser_id β on a billion-row ads table, this reduces query cost by 80β90%.- How would you handle a keyword that belongs to multiple ad groups β does your schema support that?
- A new advertiser launches mid-month β how do you handle partial-month cohort analysis in this schema?
- At what point would you use a wide denormalized table instead of a star schema for BigQuery?
You have a raw ad click events table in BigQuery that receives 500M+ rows per day. Write a dbt incremental model that processes only new partitions each run, handles late-arriving data up to 3 days old, and builds the daily aggregated fact table.
merge incremental strategy (required for late-arriving data β append would create duplicates) and whether you correctly use is_incremental() to conditionally filter the source. Google SBIA candidates must understand why unique_key is needed and what happens without it.-- models/mart/fct_ads_daily.sql
{{
config(
materialized = 'incremental',
incremental_strategy = 'merge',
unique_key = ['report_date', 'advertiser_id', 'campaign_id', 'ad_id'],
partition_by = {'field': 'report_date', 'data_type': 'date'},
cluster_by = ['advertiser_id'],
on_schema_change = 'sync_all_columns'
)
}}
WITH raw_events AS (
SELECT
DATE(event_timestamp) AS report_date,
advertiser_id,
campaign_id,
ad_id,
event_type, -- 'click' | 'impression' | 'conversion'
revenue_usd
FROM {{ source('raw', 'ad_events') }}
{% if is_incremental() %}
-- Only process last 3 days to handle late-arriving events
WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}
)
SELECT
report_date,
advertiser_id,
campaign_id,
ad_id,
COUNTIF(event_type = 'impression') AS impressions,
COUNTIF(event_type = 'click') AS clicks,
COUNTIF(event_type = 'conversion') AS conversions,
SUM(CASE WHEN event_type = 'click'
THEN revenue_usd ELSE 0 END) AS revenue_usd,
SAFE_DIVIDE(
COUNTIF(event_type = 'click'),
NULLIF(COUNTIF(event_type = 'impression'), 0)
) AS ctr,
CURRENT_TIMESTAMP() AS dbt_updated_at
FROM raw_events
GROUP BY report_date, advertiser_id, campaign_id, ad_id
incremental_strategy = 'merge' with unique_key ensures that when late-arriving events come in (e.g., a click logged 2 days late), the row for that date is replaced not duplicated. The INTERVAL 3 DAY lookback covers Google’s SLA for late-arriving ad event data. Always include dbt_updated_at β it lets you audit when each row was last reprocessed and detect if a partition stopped being refreshed.- What happens if you use
appendstrategy instead ofmergefor this model? - How would you test this model with dbt tests to ensure no duplicate unique keys appear after a merge?
- At what point would you move from incremental to a full-refresh model β and what triggers that decision?
4-Week Study Plan for Google Senior BI Analyst Interview
Google tests the deepest combination of SQL, BI tooling, and product thinking in the industry. The 4-week plan below builds all three systematically.