Databricks DSL: How we automated job orchestration

In 2024, we started building our data pipelines and workflows in Databricks. As we’ve increased our usage of the Databricks platform one of the issues we’ve found has been how we orchestrate data jobs in an optimal way.

Initially, we’d split our jobs into 3 pipelines - bronze, silver, and gold - representing the 3 stages of the medallion architecture. Whilst conceptually this may have made some sense, it meant that we had to wait until all of the bronze jobs had finished before kicking off the silver pipeline, and likewise for the gold pipeline. As you can imagine this was suboptimal because we’d have silver jobs in the silver pipeline waiting for unrelated jobs in the bronze pipeline to finish.

We thought about instead splitting the pipelines into their domains, so all jobs associated with the domain of “meter readings” would belong to their own pipeline for example, further split into bronze, silver, and gold pipelines. However, due to the nature of our data this would have introduced complexity due to the fact that a large number of jobs depend on a small subset of jobs in some way. To accomodate this we would have to manually orchestrate the dependencies which wouldn’t scale with the data demands we have.

Our DSL and Parser

Our solution has been to create our own domain specific language (DSL) and write a custom parser for it to automate the creation of a dependency graph. The syntax for our DSL is the following:

target <- <dependency> ... <dependency>

From this we construct the following graph type:

pub struct WorkflowEdge(pub NodeId, pub NodeId);

This WorkflowEdge type represents a directed acyclical graph which is vital for us as our pipelines cannot tolerate cyclical (looping) edges, as that would lead to never-ending pipeline runs. Once our DSL has been parsed we validate each node in each edge to make sure the nodes exist, and then build the JSON representation of the pipeline that we send to Databricks.

Because we know what any dependencies are on any job, we can optimise things like: when they run, how many run in parallel, what clusters they run on, etc. This means we don’t ever have a situation where we have a gold job waiting to run because a long-running bronze or silver job is still running. In practice this has reduced our end-to-end pipeline duration from ~12 hours to ~1.5 hours.