discursus Core — The Final Semantic Layer
Version 0.1 of the open source discursus Core project
Version 0.1 refines and implements the discursus semantic layer. This is to represent the skeleton of the protest movements we want to provide an abstraction for. This completes the foundational implementation of our core entities.
Implementing a comprehensive semantic layer means that we:
- Document what is the abstraction of the protest phenomenon we’re using
- Document what our data warehouse’s ERD will look like
- Refactor our data warehouse’s entity layer to reflect that new semantic layer
- Generate lookml and cubejs schemas to support the tools that represent our semantic layer.
Here’s an updated design of our data platform.
Monitoring Protest Events with Data
First off, before we get into all the details, why are we building this? Well, the mission of discursus is to provide data-driven insights into protest movements, giving us a more comprehensive and objective view of which events are associated to a movement, their actors and their narratives.
I’m really excited to be introducing a new protest monitoring dashboard that provides a high-level view of all protest movements happening around the world, in almost real-time.
I often say that I’m the #1 user of discursus and this new interactive environment is making exploring the platform’s data even way more fun. Hope you’ll enjoy it as well.
About the semantic layer
The semantic layer is the skeleton of the phenomenon / domain we want to provide an abstraction for. It has entities and relationships, and those have attributes that change in time. What’s the full abstraction we’re trying to build for discursus and what’s to be the ERD after version 0.1?
The first image is an abstraction that would roughly represent the domain we’re trying to map with discursus. The top layer is the protest movement phenomenon itself, whereas the bottom layer is how that phenomenon is being reported.
We don’t have access to that top layer directly, we can only use the observers as a proxy to what really happened. All the challenge is there — how can we reconstruct the actual phenomenon by using the observer artifacts as our raw material. That definitely means we’ll also need to take into account observer biases eventually.
The second image is what the data warehouse entities look like and how they relate to each other.
A few notes:
- Of course this is super basic for now. But I want to get the foundations right. So needless to say that we’re missing some important dimensions and attributes.
- An important goal is to have a model that will properly represent how dynamics change throughout a protest movement. So for example, an actor might be in support of a protest at some point, but gradually change stance with time.
- I’m adding that “Observer” entity in there as I will want to associate articles, social media posts to an individual and eventually add an attribute to that reporter on what’s their political inclinations, essentially how biased they are.
- I’m also adding the “Protest” entity which is the entity that associates multiple events together. This is driven by manual configurations for now, but I want to eventually use ML for that as well.
Implementing the Semantic Layer with Droughty
For me, the semantics layer boils down to documenting a domain’s entities, attributes, metrics and relationships. Droughty (from Lewis Baker at Rittman Analytics) helps me do that easily and efficiently.
My development workflow now looks like the following:
- First, design the changes I want to add to my domain.
- Build and test the new entities, attributes, metrics and/or relationships.
- Materialize the changes.
- Create the semantic definitions with Droughty.
I get an an up-to-date dbml definition by running droughty dbml
.
I get my Cube definitions by running droughty cube
.
You can also use droughty to automatically generate lookml and dbt tests. Just a super useful tool to add to your stack.
Other Improvements
Protest Grouping
A new entity that is now part of the semantic layer is protests_dim
. This represents the protest movements in our abstraction graphic.
We currently manually query protest events by using some criteria such as country where an event occurred, timeframe, keywords in article descriptions, etc. Until we introduce machine learning to group events together, we want to come up with a configuration engine that will associate events to protest configs.
And this is where we introduced those new components to our stack.
From a google sheet, we manually configure how we want to group protest events together. Those configurations are then being sourced with Airbyte directly to Snowflake. And this is then being used to build our protests_dim
entity.
Now that we have those groupings, we can use them to select a specific protest movement to analyse in our monitoring dashboard.
Data Assets
In our data products diagram above, we have “data asset” boxes, which are the endpoints of each data product. Those are not abstract concepts, but real objects in Dagster, that are useful to track attributes as well as to trigger dependant transformations for other data products.
An improvement we’ve made in this release is to have all those endpoints materialized as data assets. We can now track the performance of their computation as well as see how their attributes change through time.
Performance Improvements
One thing we meant to do for a while was to start using dbt incremental tables for our largest models. But we needed a way to control when jobs were to be ran as an incremental run or a full refresh run.
Using Dagster configs, we can now control how we want to run dbt ops from either the Launchpad
Or from schedules..
@schedule(job=build_data_warehouse, cron_schedule="15 3,9,15,21 * * *")
def build_data_warehouse_schedule(context: ScheduleEvaluationContext):
return RunRequest(
run_key=None,
run_config={
"ops": {"build_dw_staging_layer": {"config": {"full_refresh_flag": False}}},
"ops": {"build_dw_integration_layer": {"config": {"full_refresh_flag": False}}},
"ops": {"build_dw_warehouse_layer": {"config": {"full_refresh_flag": False}}}
}
)
As we can see in our Core Entities data warehouse job, our dbt runs now sets a --full-refresh
argument when set to True.
Originally published at https://www.lantrns.co.