Quite recently we’ve built event analytics for our team and thought to share this experience with you in this post .
Many of “out-of-the-box” analytics solutions come with automatically defined user sessions. It’s good to start with, but as your company grows, you’ll want to have your own session definitions based on your event data. Analyzing user sessions with SQL gives you flexibility and full control over how metrics are defined for your unique business.
The session is usually defined as a group of interactions that one user executes within a given time frame on your app. Usually, that time frame defaults to 30 minutes. One session includes any events which a user completes on your app before leaving, for example, visiting pages, downloading materials, performing actions.
Having sessions in place, we’ll be able to answer questions like:
To define a user session we need to have an event table in our database with at least user_id and timestamp.
Usually, you will have a lot of additional information in the event tables, such as event type, name, device info, referrer, and much more. All these properties are going to be very useful to give more context to our sessions and to build things, such as an attribution model.
Note: In this post, we’re going to use window functions, so the following example will not work with MySQL. Also, all these queries are dedicated to Redshift database.
The first step for defining user sessions with SQL is to find out the difference between the timestamp of the current row and the previous one, by user_id. We will use LAG function to accomplish it. This will give us an inactivity time between events.
, DATEDIFF(minutes, lag(timestamp) over (PARTITION BY user_id order by timestamp), timestamp) as inactivity_time
The inactivity_time for the first event is NULL, since it’s the first event and we don’t have anything before it.
We can use inactivity_time to group events into sessions based on 30 minute intervals of inactivity. First, we’ll select all events where inactivity_time is either NULL, or more than 30 minutes, which means it is the first event in the session.
Based on this first event, we define session_start_at, which is the timestamp of the first event. We use ROW_NUMBER function to calculate session sequence, which is used in session_id.
event.user_id || '-' || row_number() over(partition by event.user_id order by event.timestamp) as session_id
, event.timestamp as session_start_at
, lead(timestamp) over(partition by event.user_id order by event.timestamp) as next_session_start_at
, DATEDIFF(minutes, LAG(e.timestamp) OVER(PARTITION BY e.user_id ORDER BY e.timestamp), e.timestamp) AS inactivity_time
FROM events AS e
) as event
WHERE (event.inactivity_time > 30 OR event.inactivity_time is null)
We can save this table as sessions Data Mart to use it in our future queries.
Once we have this table, it’s easy to answer user analytics questions we outlined in the beginning. For example, to calculate average session duration we can use the following SQL.
COUNT(*) AS sessions_count,
AVG(duration) AS average_session_duration
, DATEDIFF(minutes, MIN(events.timestamp), MAX(events.timestamp)) AS duration
LEFT JOIN events on events.user_id = sessions.user_id
AND events.timestamp >= events.session_start_at
AND (events.timestamp < sessions.next_session_start_at OR sessions.next_session_start_at is null)
GROUP BY 1
As you see, we join events table to sessions to map every event to its session. It allows us to get the end of each session, which is the max timestamp of the event within the given session. More complex calculations of session duration could optionally include a window of inactivity as well.
To read original blog, click here.