ARM, Automation, Azure, Data Factory, DevOps, Professional

Understanding Data Factory in a CI/CD world

October 16, 2020December 14, 2020 by John Folberth

For those that are unfamiliar Azure Data Factory is defined as:

…the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Essentially think of it as a next generation SSIS engine if you are familar with the Microsoft tech stack or think of it as an alternative to SAP Data Services or Informatica if you are familiar with those tools.

If you are familiar with those tools then you are aware that they way development is done differs from a traditional application development lifecycle. This stems from the fact that if we are traversing and transforming large amounts of data we can’t do that on our local machines nor can really do that on a VM. As such we usually push/orchestrate any changes directly in a dev/sandbox environment. This traditional works; however, things get kind of hairy when we start adding more developers. We start running into potential issues with resource constraints and issue such as conflicting schema definitions.

Full transparency Data Factory does have an answer for some of this; however, it still has its tradeoffs.

Azure Data Factory “offically” supports CI/CD; however, if you walk through this documentation take special note of this diagram:

Diagram of continuous integration with Azure Pipelines

Notice that the code flow would be:

master->adf_publish

This push is triggered by the Publish button in the Azure Data Factory Pipeline Editor. What it is really doing is consolidating ARM templates.

Here is what master would look like:

While the adf_publish branch looks like:

Wait what happened? My best guess is that the publish button runs some Powershell to consolidate ARM templates. Unfortunately what this translates into is that our adf_publish branch can’t merge back into master due to conflicts:

So if you have a development background this may seem very strange to you. Our master branch does not equal our publish branch? Nor can we take our publish branch and move it back into master?

By going this approach there are a few other downsides mainly:

More likely to hit ARM Template Limitations
- 256 parameters
- 256 variables
- 800 resources (including copy count)
- 64 output values
- 24,576 characters in a template expression
Deployment will be everything
Does not include additional Data Factory settings like Monitoring
Only one ARM parameter file is produced
Access to everything in the repository

Let me stress, this approach DOES WORK. As in many things development wise there is always more then one approach. At the end of day what we are doing is deploying ARM templates, so why not do them manually?

To accomplish this a structure of repos might be created and organized such as:

Pipelines
DataSets
Linked Services
Triggers
DataFactory and it’s dependencies

Each repository will contain just the ARM template sections that it relates to. To do this each repository can publish out of it’s main branch with it’s desired ARM templates and environment specific parameters. This may feel more natural to those who do applciation development. We can create branches/tags as needed to better evaluate and manage our releases. In addition we can also now control access at the repository level. Require special access for Linked Services, now you have it.

The trade off for this flexibility would mean:

Multiple Repositories
Managed Dependencies between repositories
Still all pipelines or none

Further more another potential option would be the introduction of a repository per Pipeline. This may look like:

Pipeline1
Pipeline2
Linked Services
DataSets
Triggers
DataFactory and it’s Dependencies

Again a lot of the same tradeoffs as mentioned before still exist; however, now we can scope everything down to the pipeline at the repository level. Not to mention now each pipeline can adhere to it’s own development life cycle.

However even this has it’s tradeoffs:

Even more repositories
More likely to hit 800 resource group deployments limit
Dependencies
“Technically” every deployment redeploys a shell data factory
- Updating Data Factory API could be problematic

Let me stress again ANY OF THESE WAYS WORK. It’s just a matter of how a team/organization/individual wants to maintain and organize their work. Hopefully this has given a good overview on just some of the options available to you for CI/CD with Data Factory.