Dagster is just better
Asset functions have made Dagster as my go-to for data engineering automation
A recent pairing reminded me why I just really like Dagster: it makes local testing and troubleshooting a breeze.
I don’t have to have a connection to the data warehouse, cloud storage, a compute platform, etc. Well-composed assets are just simple functions that I can run in the Python interpreter.
Dagster differs from Airflow, Prefect, and other orchestrators by putting the focus on physical assets instead of operations. Dagster specifically requires a developer to write the business logic as a function that takes in data, and emits data.
It felt like a minor difference to me when Dagster showed up in 2019. But it just feels right. Or at least better. Focusing on data inputs and outputs moves interfaces with external systems outside the asset. The functions that define assets can run on my machine, even without internet.
An example
My team was working on a data pipeline that stored inference metadata in our data warehouse. It looks something like this:
from dagster import asset
from pandas import DataFrame
@asset
def inference_log(features: DataFrame, scores: DataFrame) -> DataFrame:
inferences = features.join(scores)
# ... a bunch of transformations and validations
return inferences
To figure out what was happening, we did this in a Python REPL:
>>> import pandas as pd
>>> from assets import inference_log
>>> features = pd.read_parquet("features.parquet")
>>> scores = pd.read_parquet("scores.parquet")
>>> inferences = inference_log(features, scores)
That’s it. We didn’t have to configure connections to other systems (in this case, Databricks). We didn’t have to comment out a bunch of code that was touching other systems. We just called the function that defines the asset.
Compared to Airflow
In Airflow, we probably would have done something like this:
from airflow.sdk import task
from databricks import sql
import pandas as pd
@task()
def write_inference_log():
connection = sql.connect()
features = pd.read_parquet("s3://bucket/features.parquet")
scores = pd.read_parquet("s3://bucket/scores.parquet")
inferences = features.join(scores)
# ... a bunch of transformations and validations
inferences.to_sql("ml_logs.inference", connection)
Running this function requires a full environment. I need connections to S3 and Databricks. Pre-configured remote environments sound great, but how do I work on a plane or a cabin in the woods? Or, more likely, from the cafe down the street that has brutally slow internet?
Dagster’s pattern
Of course, I can be careful to define my Airflow tasks in a way that makes them easy to run and troubleshoot locally. But I’m not that smart. I spent 5 years working in Airflow, and I never saw it done, nor did I think to do it myself.
Instead, with Dagster, if you just follow the default pattern, you’re golden:
- IO managers (i.e., subclasses of
ConfigurableIOManager
) handle interactions with the outside world - Assets (i.e., functions decorated with
@asset
) transform parameters into results
When the task / asset doesn’t do anything with the outside world a bunch of things are obvious and easy:
- You can create unit tests for assets
- You can run assets locally, without an internet connection, with breakpoints, profilers, anything you like
- You can tweak inputs locally and re-create edge cases
Yeah, I could probably figure out how to do this with any framework. But then what’s the point of the framework? In Dagster’s case, the clever thinking is done for me, leaving me to do what I’m best at: Cranking out kind-of-okay data pipelines.