Hugo Chinchilla

A DBT Journey

If you have a subscription based business you will realize that the best predictor of subscription renewal is the amount of use a user makes of your service. Here is an example of an e-learning platform based on lessons completed by month.

Lessons completed by month Renewal
0 12%
0-4 53%
4-12 66%
12+ 71%

Being able to monitor the user engagement with your platform will allow you to take actions to improve user retention, some examples of this actions can be gamification or sending push notifications. Being able to track the effect of changes over allow businesses to make data-driven decisions.

Here is an example of a dashboard showing how user interactions decrease over time after signup, being able to monitor this data is key to take actions to increase user satisfaction.

We were recently tasked with creating such dashboard using Google’s Looker Studio, here is the story of how we did it.

The data model

The task looks easy at first glance, until you realize that the data model is less than ideal for writing the queries that return the data we need.

In an ideal scenario, the subscriptions table would look something like this:

id user_id start end status
101489 21007 2022-10-12 2023-10-20 cancelled
108312 21014 2022-10-21 2024-03-08 expired
110266 21027 2022-10-25 null active

But instead, WooCommerce stores user subscriptions as custom post types, using a post meta table to add custom fields on them, as the start date of a subscription, the end date, the renewal date, etc.

Here is a subset of the posts table:

ID post_date post_status post_parent post_type post_title
2 2024-07-01 22:08:24 wc-active 1 shop_subscription Suscripción – 1 de July de 2024, 8:08 PM
1 2024-07-01 22:08:24 wc-completed 0 shop_order Order – julio 1, 2024 @ 10:08 PM

And here an example of some related post metas:

meta_id post_id meta_key meta_value
1 2 _schedule_start 2024-07-01 00:45:18
2 2 _schedule_end 2025-07-01 00:45:18
3 2 _customer_user 123

We can use the WITH clause to reformat this data in a friendlier manner, and then query over the created alias, but preparing all the information this way would make our queries very complicated and would result in lots of duplicated code for various reports. So instead we decided on using DBT to transform the data into materialized views.

We worked with DBT some months ago while working for another client (one of the best things of working on consultancy and having many clients is that you get exposed to lots of technologies).

This was a perfect use case for DBT, the original data model needed some transformations to make it usable, we can write some queries to transform the data and DBT will use this queries (now called models in DBT jargon) to build materialized views. Not only will this make our reports faster, we can define incremental strategies to import new data as time passes, not having to rebuild the views from scratch makes refreshing the data superfast.

DBT also allows us to write tests over our models, allowing us to validate assumptions, this allows us to make sure we have a good understanding of the data model and that our assumptions remain true in the future when the wordpress instance and it’s plugins get updates that may make changes to the data model. Writing tests, is one of our core-values, so of course we made good use of this feature.

Step 1: defining the subscription model

We created a simple model to unify all subscription data in a single row, making it easier to index it and query against it.

The model query looks something like this, in reality is a little bit more complex because we need to get some more data to create some filters in the final report, like categorizing the users by country, language, subscription modality, etc.

-- filename=subscriptions.sql

{{ config(materialized='table') }}


select
   subscription.ID as id,
   customer.meta_value as user_id,
   postmeta_start.meta_value as start,
   postmeta_end.meta_value as end,
   subscription.post_status as status
from wp_posts as subscription

-- join with order to filter out checkout carts that where left unfinished
inner join wp_posts as payment_order
   on subscription.post_parent = payment_order.ID
  and payment_order.post_status = 'wc-completed'

-- join with metas to get subscription start date
inner join wp_postmeta as postmeta_start
   on postmeta_start.meta_key = '_schedule_start'
  and postmeta_start.post_id = subscription.ID

-- join with metas to get subscription end date       
inner join wp_postmeta as postmeta_end
   on postmeta_end.meta_key = '_schedule_end'
  and postmeta_end.post_id = subscription.ID

-- join with metas to get the user ID that purchased the subscription       
inner join wp_postmeta as customer
   on customer.meta_key = '_customer_user'
  and customer.post_id = subscription.ID

-- only subscriptions with valid status stored as wordpress posts   
where subscription.post_type = 'shop_subscription'
 and subscription.post_status IN ('wc-active', 'wc-cancelled', 'wc-expired', 'wc-pending-cancel')

Step 2: create the activity model

To get the engagement report we also need to know the user activity, we need to find traces of users using the e-learning platform. Fortunately, learndash uses custom tables instead of trying to cram all the information into custom post types as wooCommerce does, so it’s much easier to get what we need.

-- filename=activities.sql

{{ config(materialized='incremental') }}


select
   user_id,
   -- learndash stores dates as timestamps that need to be converted to the same date format as subscriptions
   date(from_unixtime(activity_completed)) as date
from wp_learndash_user_activity
where activity_completed > 0

{% if is_incremental() %}
  and activity_completed >= (
    -- get greatest event date already imported into the materialized table
    select unix_timestamp(coalesce(max(date), '2000-01-01')) from {{ this }}
  )
{% endif %}

wohoooo!, we now have the user signups and the user activities, we are ready to create the report, aren’t we?

Step 3: create a date spine

Well, not quite. Remember the table shown above? We need to group the users in weekly cohorts and the activities into weekly groups over the next weeks after the signup.

It has to show the number of users that purchased a subscription each week and then monitor how many of them are still active after 12 weeks, seeing the numbers week after week.

To aggregate the data by week we will create a date spine table. A date spine is just a table filled with dates that we can use to group our signups and activities by date periods in order to count them.

I will create this date spine with the following query:

# filename=calendar_weeks.sql

{{ config(materialized='table') }}


select
   min(gen_date) as first_day,
   date_add(min(gen_date), interval 6 day) as last_day,
   floor(datediff(gen_date, '2018-01-01')/7) as week_id
from
(select adddate('1970-01-01',t4*10000 + t3*1000 + t2*100 + t1*10 + t0) gen_date from
(select 0 t0 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0,
(select 0 t1 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1,
(select 0 t2 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2,
(select 0 t3 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3,
(select 0 t4 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v
where gen_date between '2018-01-01' and '2025-12-31'
group by week_id
order by first_day asc

This will produce a table with an ever-increasing week identifier and the date ranges for every week. We need the week number to be something that does not reset between years, and we also want to be able to subtract the ids to get the amount of weeks between ranges that’s why we did not use the week().

first_day last_day week_id
2018-01-01 2018-01-07 0
2018-01-08 2018-01-14 1
2018-01-15 2018-01-21 2
2018-01-22 2018-01-28 3
2018-01-29 2018-02-04 4

Step 4: aggregate data by weekly periods

Now we need to put all the data together.

-- filename=engagement_by_week.sql

with new_users as (
    select subscriptions.id,
           cal.first_day,
           null as weeks_after_signup
    from
         as subscriptions,
         as cal
    where subscriptions.start between cal.first_day and cal.last_day
      and cal.last_day < curdate()
    group by 1,2,3
),
active_users as (
    select
        subscription.id,
        subscription_periods.first_day,
        (activity_periods.week_id - subscription_periods.week_id) as weeks_after_signup
    from
         as activity,
         as activity_periods,
         as subscription,
         as subscription_periods
    where activity.date between activity_periods.first_day and activity_periods.last_day
      and subscription.start between subscription_periods.first_day and subscription_periods.last_day
      and activity.user_id = subscription.user_id
      and activity.date >= subscription.start
      -- limit data to last whole day
      and subscription_periods.last_day < curdate()
      -- get only activities after the subscription started
      and activity_periods.week_id >= subscription_periods.week_id
      -- only during the next 12 weeks after signup
      and (activity_periods.week_id - subscription_periods.week_id) <= 12
    group by 1,2,3
)

This model will generate the data ready for Looker to import it and create a pivot table.

Step 5: render the table with SQL

Explaining how to create the pivot table in Looker Studio is a bit complex, instead I will show you how to use DBT to create a table with the same information (it will lack the filters but you can modify it as you need).

-- filename=engagement_by_week_cohort_view.sql

select
    first_day,
    -- create a column with the total signups in a period
    sum(case when isnull(weeks_after_signup) then 1 else 0 end) as new_users,
    -- create the columns for next following 12 weeks
    {% for i in range(1,13) %}
        -- show zeros as null
        nullif(
            -- count users that had at least one activity during the period
            sum(case when weeks_after_signup = {{ i }} then 1 else 0 end), 0
        ) as week_{{ i }}
        -- add a comma only if not last loop
        {{ ", " if not loop.last else "" }}
    {% endfor %}
from
    {{ ref('engagement_by_week') }}
group by first_day
order by first_day asc

Well, this sums it all.

This is an oversimplification of the real project, there are lots of nuances and extra requirements in the project to allow filtering by many dimensions to gather the required business intelligence that was requested. This post only want to be a showcase of how DBT can be leveraged to extract information from any model and make it into a more suitable reporting format.

If you want to make something similar, contact us. We can help you build your BI dashboards to help you make your business thrive.