Data Preparation in the Modern Data Stack
Can we talk about data preparation of your data? I know… not the most interesting aspect of your data platform, right? But I think we can all agree on how important it is. Data is messy! You’ve got missing values, typos, irrelevant data points, or maybe you want to add classifications, extract entities from unstructured data, etc.
It’s important… yet mostly under-served for analytics engineers.
“These are the types of features that we used to have back in the days of Oracle and enterprise business intelligence suites. The modern data stack is slowly rediscovering all of this but it’s not new, it’s a set of needs that are just not served anymore”
Mark Rittman, CEO of Rittman Analytics
What happened to data source preparation? Has it been relegated to the data wrangling world of data scientists? As analytics engineers, shouldn’t this be in our realm, as we are essentially responsible for the whole flow of data within our data platforms?
This post is mostly exploratory, the output of my research after discussions with practitioners and reading whatever came between my hands. If anything that can be taken away from this, it’s that there just might be another layer out of our current grasp that would be worth rediscovering in order to unlock more, richer and valuable data.
What is Data Preparation
In the words of Gartner:
“Data preparation is an iterative and agile process for finding, combining, cleaning, transforming and sharing curated datasets for various data and analytics use cases including analytics/business intelligence (BI), data science/machine learning (ML) and self-service data integration.
[…]
Furthermore, they allow users to identify anomalies and patterns and improve and review the data quality of their findings in a repeatable fashion. Some tools embed ML algorithms that augment and, in some cases, completely automate certain repeatable and mundane data preparation tasks.”
Data Preparation Tools Reviews and Ratings, by Gartner
We are lucky to have fantastic data ingestion services, but not all sources are serviceable through them. Take for example transactional databases. They are specific to a business domain and may have schematic malformations cumulated from years of app iterations. I’ve also worked with companies that had bought competitors and inherited their transactional databases, which also had schemas that didn’t fit nicely with a company’s main transactional database.
It is therefore part of our duty to take care of sourcing the data that is not supported out of the box by ingestion services. It becomes our responsibility to ingest high quality data sources that will not pollute our downstream data stack.
When dealing with imperfect data sources, we encounter many challenges. We first need to profile and repair that data, which include the following tasks:
- Statistical profiling (e.g. do summary statistics correspond to expectations, such as a transaction’s average amount)
- Cleansing (e.g. invalid emails)
- Deduplicating (e.g. entity resolution for persons)
- Mask sensitive data (e.g. SSN, credit card info)
Those data sources might also present multiple opportunities for enrichment. That could include:
- Extracting entities (e.g. which products are mentioned in a tweet)
- Labeling entries (e.g. how positive or negative is a user’s comment on a Slack thread)
- Extrapolation (e.g. can we extrapolate the State in an incomplete form submission)
- Normalization (e.g. are sessions with a duration over an hour reflective of reality or were they the result of a system issue)
Bottom line is that your source data usually needs more massaging than just renaming fields and explicitly casting them to the right type.
Sample Data Preparation Pipeline
I want to take a real-world project as an example of what data preparation means while building a data product.
The discursus project is a data product that continuously delivers data on protest movements throughout North America. Here’s a diagram of its data asset production flow.
The architecture looks like this.
It uses GDELT as its main source of data. It is a massive, real-time, database of political events depicted in news media. It is definitely an immensely rich data source, but it does come with its own set of challenges.
To get that data to be meaningful for this project, I have data preparation tasks to:
- Classify the relevancy of entries in regards to our data product’s domain
- Enrich media urls with the article’s metadata and content
- Extract entities from the articles, such as the number of participants at a protest events
Those are just a few of the data preparation I’m currently doing, but there are many more on the roadmap. That includes coming up with my own values for fields that are already provided by GDELT but that are somewhat off, or not accurate enough in my opinion.
Just dealing with that single source of data made me aware of how important data preparation is before loading that data into a cloud data warehouse.
Analytics Engineers Doing Data Preparation
Before dbt, putting together a data warehouse could feel like stitching together a few pipelines together using stored procedures, python scripts and cron jobs. It was painful to say the least. We were, or at least I was, spending way more time fixing issues than delivering new analytical value.
dbt came in and packaged the whole DAG (directed acyclic graph) of data processing in an elegant library that allowed the soon-to-be-named analytics engineers to build new data assets and not just troubleshoot failing pipelines. Amen!
From that revolutionary core, as well as the advent of cloud data warehouses, a modern data stack (MDS from now on) world emerged. And although we gained a lot from this revolution, some functionalities fell through the cracks.
The MDS paradigm has been built around the core functionalities of loading data in your cloud data warehouse and transforming it. But a data platform is broader than just those layers. Data preparation seems to have been one such task that has been “forgotten” as it didn’t fit nicely within that MDS paradigm.
But the need is still there. So how do analytics engineers prepare their data at the moment?
Reading through Slack channels, I stumbled upon a few conversations such as this one that gives an idea of how eclectic the approaches are.
Besides all the various tools and needs described in that thread, one particular use case caught my attention. A user talks of how his team needs to pre-process over 40k tables as they are sourcing system files from manufacturers around the world… and none of those source files are the same. That to me is a perfect example of why data preparation is still essential in a MDS world.
Your Options in the Modern Data Stack
So how do you do data preparation nowadays? Here’s a rough and incomplete layout of some of your options.
The axis used here are:
- The horizontal axis is how close to the MDS principles those solutions are. Are they cloud agnostic? Are they modular and interchangeable? Or are they living and only play nicely within a vendor’s closed ecosystem?
- The vertical axis is how specific that solution is to data preparation. We have solutions that allow us to do data preparation but which are not specifically built for that purpose. Or you can have tools that are all about data preparation of data sources before moving data further down the stack.
As we can see, there are a few options at hand:
- Enterprise vendor tools such as Oracle Cloud Data Preparation Service
- Cloud vendor tools such as AWS Glue or Google Cloud Data prep.
- BI vendor UI-driven solutions such as Tableau Prep.
- MDS vendors that offer capabilities to externalize data processing, such as Snowflake Snowpark or dbt’s upcoming Python models.
- And then there is a world of open source libraries.
But as modern data stack practitioners, we want to stay true to those principles of modularity, interchangeability, “do one thing, and do it well”, code-driven, code agnosticity, etc..
What if you want data preparation to happen within your stack, within the boundaries of that data platform you are building and that is mostly driven by code and can be deployed, monitored and continuously improved using DataOps’ best practices?
Then I guess there are a few DIY solutions at your disposal. I mean…
“The libraries exist… but you need a team to implement them and keep them running. Tools like Novacene aim to free up the data scientists to focus less on maintenance and more on innovation.”
Marcelo Bursztein, CEO of NovaceneAI
A New Layer
Acronyms are great, because they can convey complex ideas in simple-to-grasp buzz words. The MDS is how we went from an ETL (Extract, Transform, Load) paradigm to ELT. Meaning that data is loaded in your cloud data warehouse, and only then do we transform that data into the meaningful entities that will be consumed by your data apps.
But acronyms can be limiting if we let them stop us expand beyond their immediate meaning. Your data doesn’t really care if its lifecycle cannot all be contained only in the warehousing phase. The ELT paradigm is great, but let’s not lose focus on how broad your data platform really is, and let’s not drop all that was good in the past as well.
If we refer to a16z’s crowded mapping of today’s data platform architectures, we can see how that ELT paradigm fits in there. But it’s way broader than this. And even within that overly complex architecture, I still don’t see where data source data preparation is supposed to take place.
Let’s switch decades and take a look at a slide taken from an “old” Oracle deck on their own data preparation solution. Looking at their selection of data sources, it is more obvious that a data platform deals with more varied sources and that there is a need to repair (prepare) and enrich your data before you do any kind of transformation.
Those data sources are sometimes unstructured, sometimes incomplete, sometimes not uniform from version to version, or full of bad values, etc. Whenever we start ingesting data that is not from your usual SaaS API supported by a data loader service, we’re bound to face escalating data issues.
That amounts to quality issues, lost opportunities with unstructured data that we can’t easily ingest, and an exponentially more complex architecture to monitor, support and extend.
Not so-long ago I would have put data preparation in the same bucket as staging source data in a dbt project. But now that we have mature orchestrators such as Dagster and Prefect as our disposal, there is an opportunity for the right person to do data preparation at the right time, with the right tool.
For discursus, I’ve added a new data preparation layer in my stack which is serviced by a bundle of services and python scripts.
It works well and it somewhat follows the core MDS principles. And I at least now have it as a modular task in my data platform, as well as I have more visibility into its performance and output.
But I do have to code those transformations individually, package them and automate its flow. Also, as an analytics engineer, I may not be the best person to own that preparation layer. Could be that someone more knowledgeable, closer to the business domain, might just be the best person to come with the transformation specifications. And as a data engineer, I only plug them into the data platform and monitor its execution.
Servicing that Layer
What I’m really interested in are solutions that provide mechanisms and recommendations to repair and enrich my data sources.
All the tasks above should be driven by an AI-assisted preparation tool that:
- Identifies opportunities to repair and enrich the data
- Takes user input to accept or decline suggestions
- Integrates user input for new rules to follow when identifying future data preparation opportunities
- Automate the application of certain or all rules
Maybe what we’re looking for is a mixture of Data Prep UI tools, used by domain experts to transform the data first, in combination with a code output runner that can be orchestrated to run in the data preparation layer.
And maybe that solution already exists. For example, Google Cloud Dataprep is a UI tool for a domain expert to transform source data in high quality actionable datasets. The output is Apache Beam code that can be invoked through a runner such as DataFlow. And maybe that runner can be orchestrated by Dagster for example, so that we always have a single point of view and control of our whole data platform and its data assets..
We could then end up with a flow that looks somewhat like this?
Of course, that solution design is all hypothetical at this point and will be further explored in a future post. But the main takeaway here should be that there is an opportunity for data preparation to be done at the proper moment in our data platforms and productionized using the MDS core principles and within the bounds of the tools that allow us to produce with confidence the high-quality datasets our users expect.
Unleash the Data
The data preparation world today feels like the pre-dbt world back then. There’s not a lack of tools, open source libraries, architecture to support any kind of data preparation we might think of. But wouldn’t it be great if we could rely on a dbt-like package/service that would fit nicely in our stacks and take care of the data preparation of all data sources before it even got to our data warehouses?
Think of all that unstructured data you got lying around in your data lake that is not bringing any value. Think of all the small repairs that you are doing in dbt at the moment, but that should probably have been done prior to being staged. Think of all the new fields you’re creating (“case… when…” anyone?) again in dbt, but that should still have been done prior.
There’s a vast world of untapped data out there. And although ingestion SaaS providers are doing a great job for the most common sources, we’re ignoring all the other ones and leaving a ton of untapped insight on the table.
The modern data stack is awesome, no arguing about that. But as our approach to data platform development expands, there’s an opportunity to rediscover tasks that were taken care of in the older enterprise world. The data platform landscape is wide and has yet to be fully captured by our MDS principles and best practices.
Here’s to hoping that analytics engineer practitioners and new companies take note, embrace this “boring” problem and unleash the potential value of all the data out there.