Moonfrog Interview Questions | Queries
Consider the following database schema for an e-commerce website: Given the following schema of tables:
Event table (no index/key) – contains event data recorded when an event occurs:
event_name [VARCHAR],
user_id [VARCHAR],
time [TIMESTAMP]Transaction table (no index / key) – contains transaction data for item_id [VARCHAR],
quantity [INTEGER],
price [NUMERIC],
user_id [VARCHAR],
time [TIMESTAMP]
Provide detailed queries to retrieve the answers to the following questions:-
-The average time between first visit and first purchase across users. [consider event name ‘visit’]
-Median and 80 percentile time between first visit and first purchase across users. [consider event name ‘visit’]
– The average time between first visit and purchase per item_id.
– Median and 80 percentile time between first visit and purchase item per item_id.
– Given an item_id [consider item_id ‘shoe_01’], retrieve an ordered list of items that are most likely to be purchased by a user who purchased the item.
Answers ( 5 )
1.
SELECT t.user_id, AVG(timestampdiff(second, e.mintime, t.mintime))
FROM (SELECT user_id, min(timestamp) AS mintime
FROM event e
where event_name= ‘visit’
GROUP BY user_id
) e JOIN
(SELECT user_id, min(time) AS mintime
FROM trans t
GROUP BY user_id
) t
ON e.user_id = t.user_id
GROUP BY t.user_id;
2.
select t.user_id,
percentile_cont(0.80) within group(order by timestampdiff(second, e.mintime, t.mintime) over (partition by t.user_id) as percentile_cont_80,
percentile_cont(0.50) within group(order by timestampdiff(second, e.mintime, t.mintime) over (partition by t.user_id) as median
FROM (SELECT user_id, min(timestamp) AS mintime
FROM event e
where event_name= ‘visit’
GROUP BY user_id
) e JOIN
(SELECT user_id, min(time) AS mintime
FROM trans t
GROUP BY user_id
) t
ON e.user_id = t.user_id
GROUP BY t.user_id;
3. Same codes will be for the rest of the problems (instead of select t.user_id, it will be t.item_id)
Hi Ognish
Don’t you think the question is asking for Avg, Median and 80 Percentile as a whole and not for each user.
I was confused regarding the same but seeing the granularity of data I feel it should be at an overall level.
One other thing to support this claim is that in your case, all three (Avg, Median and 80th Percentile) are anyways going to be same for each user.
Here are my answers to first 4 questions. Please note, I have combined answers 1,2 and 3,4 together as they had almost identical queries and same logic.
Answers 1 & 2.
WITH first_visit
AS (SELECT user_id
,MIN(time) AS first_visit
FROM event
WHERE event = ‘visit’
GROUP BY user_id
)
,first_purchase
AS (SELECT user_id
,MIN(time) AS first_purchase
FROM transaction
GROUP BY user_id
)
SELECT
AVG( DATEDIFF(‘hour’, v.first_visit, p.first_purchase) / 24) AS avg_time
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ( DATEDIFF(‘hour’, v.first_visit, p.first_purchase) / 24)) AS median
,PERCENTILE_DISC(0.8) WITHIN GROUP (ORDER BY ( DATEDIFF(‘hour’, v.first_visit, p.first_purchase) / 24)) AS 80_percentile
FROM first_purchase p JOIN first_visit v ON p.user_id = v.first_visit
3 & 4.
WITH first_visit
AS (SELECT user_id
,MIN(time) AS first_visit
FROM event
WHERE event = ‘vist’
GROUP BY user_id
)
,first_purchase
AS (SELECT user_id
,item_id
,MIN(time) AS first_purchase_per_item
FROM transaction
GROUP BY user_id ,item_id
)
SELECT
AVG( DATEDIFF(‘hour’, v.first_visit, p.first_purchase_per_item) / 24) AS avg_time
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ( DATEDIFF(‘hour’, v.first_visit, p.first_purchase_per_item) / 24)) AS median
,PERCENTILE_DISC(0.8) WITHIN GROUP (ORDER BY ( DATEDIFF(‘hour’, v.first_visit, p.first_purchase_per_item) / 24)) AS 80_percentile
FROM first_purchase p JOIN first_visit v ON p.user_id = v.first_visit
Hi Himanshu, even I thought the same, but in the question, it is “The average time between first visit and first purchase across users” – across users? Can you please shed some light on it? I am a bit confused I am thinking across users as individual users, that’s why wrote the query that way
In the 3rd and 4th part we need to calculate percentile at user id and item id level. We need to partition the result on user id and item id
1 and 2
select avg(diff), percentile_disc(0.5) within group (order by diff asc),
percentile_disc(0.8) within group (order by diff asc)
from
(
select id, (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id
)a
group by id
) b
3
select id, itemid, avg(diff)
from
(
select id, itemid, min(ttime), min(etime), (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)a
group by id, itemid
order by id, itemid
) b
4 and 5
select id, itemid, percentile_disc(0.5) within group (order by diff asc) ,
percentile_disc(0.8) within group (order by diff asc)
from
(
select id, itemid, min(ttime), min(etime), (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)a
group by id, itemid
order by id, itemid
) b
For 3rd and 4th I think we need metrics at item level
3)select itemid, avg(diff)
from
(
select id, itemid, min(ttime), min(etime), (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)a
group by itemid
order by itemid
) b
4)select distinct itemid, percentile_disc(0.5) within group (order by diff asc) over (partition by itemid),
percentile_disc(0.8) within group (order by diff asc) over (partition by itemid)
from
(
select id, itemid, min(ttime), min(etime), (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)a
group by id, itemid
order by id, itemid
) b
5)
with a as
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)
select a1.itemid
from a a1
where a1.id in (select distinct id from a where a.itemid =’shoe_01′)
group by a.itemid
order by count(*) desc
1 and 2
select avg(diff), percentile_disc(0.5) within group (order by diff asc),
percentile_disc(0.8) within group (order by diff asc)
from
(
select id, (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id
)a
group by id
) b
3
select id, itemid, avg(diff)
from
(
select id, itemid, min(ttime), min(etime), (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)a
group by id, itemid
order by id, itemid
) b
4 and 5
select id, itemid, percentile_disc(0.5) within group (order by diff asc) ,
percentile_disc(0.8) within group (order by diff asc)
from
(
select id, itemid, min(ttime), min(etime), (min(ttime)-min(etime)) diff
from
(select t.user_id id , t.item_id itemid, t.time ttime, e.time etime
from trans t join event e on t.user_id = e.user_id
order by id, itemid
)a
group by id, itemid
order by id, itemid
) b