How I use Dagster to orchestrate the production of social science data assets

Olivier Dupuis
RepublicOfData.io
Published in
5 min readNov 9, 2022

--

I’ve said in the past how fortunate we are as analytics engineering practitioners to be shaping and expanding a space in the data ecosystem that holds many possibilities. Amongst them is the thrill of using those superpowers to create data products that can disrupt established practices in other fields.

Most of us are doing daily exceptional work in the business world, stitching together sophisticated data platforms to source, transform and serve analytics to different areas of a business that might thrive on the dissemination of that knowledge.

But I also believe that the opportunities are elsewhere, beyond the business realm. And this has been core to the discursus project, which has 2 missions:

  • Continuously deliver data on protest movements throughout North America.
  • But to also be a public and open-source playground to apply data platform engineering to the social sciences.

The project went through an important refactoring exercise in the past weeks, months even. Our architectural goals remain the delivery of fresh and high-quality data, to allow flexibility in how we can consume the platform’s output, and scale.

With that in mind, the reconceptualization of Dagster around software-defined assets has been an important change of paradigm. We are not pipeline engineers, but data asset engineers. That influences how we design platforms and orchestrate data transformations.

What are software-defined assets

Many of us have been introduced to the modern data stack through dbt. Once you understand the simplicity and power of DAGs and how they drive the transformation of data through multiple layers of modeling, then your mind starts applying that pipeline thinking to all data tasks.

But is that really the value that we bring to the table? Are end-users interested in pipelines? Most probably not.

End users care about the assets you are producing. Those could be warehouse tables, notebook reports, app interfaces, or a good old csv file. As data platform engineers, those are the goods we are bringing to the table, regardless of how clever our engineering is behind the scenes.

So it makes sense that a data platform orchestrator is structured around the value it is producing. And that’s where Dagster’s software-defined assets come in.

Instead of monitoring and reporting on jobs that were essentially pipelines…

Monitoring jobs

…we now orchestrate around assets. That means that we are now running and monitoring the outputs of our pipelines.

Graph of assets production

This dag above is now structured around assets and not transformation steps. Meaning that if I wanted to materialize an asset in my “prepared_sources” group, the orchestrator would know that it should refresh the assets in the “sources” group first.

Orchestrating social science data assets

What does this all have to do with delivering data assets for the social sciences?

To me, the social science domain differs from our usual online business domain in that its data notoriously becomes complex, fast. Because of its volume, primarily unstructured format and subjective nature, social science data needs a high level of attention and curation in its transformation process.

What that means is that social artifacts are plenty, think of the Twitter firehose. But how you actually source that data, clean, enhance and transform it, is usually a more complex process than how we would deal with Segment-produced product events for example.

The social science domain requires us to build around the assets we produce in a complex sequence of transformations, and have the means to monitor each asset to validate that its output meets our requirements.

For the discursus project, with only a single source of data so far (the GDELT project), we have 4 phases of asset transformations and our platform now allows us to closely monitor their output.

discursus asset groups
  • The first group of assets (sources) is the mining of source data and their transfer to a data lake.
  • The second group of assets (prepared_sources) cleans up and enhances our source data. We talked about that step in a previous post.
  • The third group of assets (data_warehouse) stages data in our cloud data warehouse and builds the entities that will be consumed by our data apps.
  • And finally, the fourth group of assets (data_apps) is where we automate the production of the data apps consumed by our end users.

All those assets are materialized through a DAG that is defined by interdependencies between assets and triggered by job schedules or sensors.

We end up with a catalogue of assets…

Catalogue of assets

… from which we can explore each asset’s metadata…

Asset metadata

…consult their definition…

Asset definition

…and their lineage with other assets.

Asset lineage

Defining as code what the assets are in your data platform makes it more tangible what is the value you are producing. At the end of the day, it doesn’t change your output, just the paradigm through which you are producing those assets.

Expanding our practice to tame complexity

If dbt helped us tame the building of data warehouses, we can now start relying on an orchestrator to go beyond the data warehouse + BI layer. Dagster allows data platforms to evolve, scale and ensure the delivery of high-quality data assets.

It becomes a framework from where you can define, execute and monitor how you source, clean, enhance, stage, transform, warehouse, package and serve data. Those steps produce and manipulate data assets which is the tangible value that we are delivering as data platform engineers.

Social sciences are not only challenging playgrounds but domains where we can have a substantial impact as data platform engineers. By shifting our orchestration paradigm to the value we bring to the table, data assets, we are setting up a foundation from which we can tame the domain’s complexity and evolve our practice.

--

--