Recurring batch data pipelines are a staple of the modern enterprisescale data warehouse. As a data warehouse scales to support more products and services, a growing number of interdependent pipelines running at various cadences can give rise to periodic resource bottlenecks for a cluster. This resource contention results in pipelines starting at unpredictable times each day and consequently variable landing times for the data artifacts they produce. The variability gets compounded by the dependency structure of the workload, and the resulting unpredictability can disrupt the project workstreams which consume this data. We present Clockwork, a delay-based global scheduling framework for data pipelines which improves landing time stability by spreading out tasks throughout the day. Whereas most scheduling algorithms optimize for makespan or average job completion times, Clockwork’s execution plan optimizes for stability in task completion times while also targeting predifined pipeline. Online experiments comparing this novel scheduling algorithm and a previously proposed greedy procrastinating heurstic show tasks complete almost an hour earlier on average, while exhibiting lower landing time variance and producing significantly less competition for resources in a target cluster. - View it on GitHub
Star
4
Rank
2290529