Ihor Zahorodnii DataOps for the modern data warehouse
Ihor Zahorodnii
DataOps for the modern data warehouse
This article describes how a fictional city planning office could use this solution. The solution provides an end-to-end data pipeline that follows the MDW architectural pattern, along with corresponding DevOps and DataOps processes, to assess parking use and make more informed business decisions.
Architecture
The following diagram shows the overall architecture of the solution.
Dataflow
Azure Data Factory (ADF) orchestrates and Azure Data Lake Storage (ADLS) Gen2 stores the data:
The Contoso city parking web service API is available to transfer data from the parking spots.
There’s an ADF copy job that transfers the data into the Landing schema.
Next, Azure Databricks cleanses and standardizes the data. It takes the raw data and conditions it so data scientists can use it.
If validation reveals any bad data, it gets dumped into the Malformed schema.
Important
People have asked why the data isn’t validated before it’s stored in ADLS. The reason is that the validation might introduce a bug that could corrupt the dataset. If you introduce a bug at this step, you can fix the bug and replay your pipeline. If you dumped the bad data before you added it to ADLS, then the corrupted data is useless because you can’t replay your pipeline.
There’s a second Azure Databricks transform step that converts the data into a format that you can store in the data warehouse.
Finally, the pipeline serves the data in two different ways:
Databricks makes the data available to the data scientist so they can train models.
Polybase moves the data from the data lake to Azure Synapse Analytics and Power BI accesses the data and presents it to the business user.
Components
The solution uses these components:
Azure Data Lake Storage (ADLS) Gen2
Scenario details
A modern data warehouse (MDW) lets you easily bring all of your data together at any scale. It doesn’t matter if it’s structured, unstructured, or semi-structured data. You can gain insights to an MDW through analytical dashboards, operational reports, or advanced analytics for all your users.
Setting up an MDW environment for both development (dev) and production (prod) environments is complex. Automating the process is key. It helps increase productivity while minimizing the risk of errors.
This article describes how a fictional city planning office could use this solution. The solution provides an end-to-end data pipeline that follows the MDW architectural pattern, along with corresponding DevOps and DataOps processes, to assess parking use and make more informed business decisions.
Solution requirements
Ability to collect data from different sources or systems.
Infrastructure as code: deploy new dev and staging (stg) environments in an automated manner.
Deploy application changes across different environments in an automated manner:
Implement continuous integration and continuous delivery (CI/CD) pipelines.
Use deployment gates for manual approvals.
Pipeline as Code: ensure the CI/CD pipeline definitions are in source control.
Carry out integration tests on changes using a sample data set.
Run pipelines on a scheduled basis.
Support future agile development, including the addition of data science workloads.
Support for both row-level and object-level security:
The security feature is available in SQL Database.
You can also find it in Azure Synapse Analytics, Azure Analysis Services (AAS) and Power BI.
Support for 10 concurrent dashboard users and 20 concurrent power users.
The data pipeline should carry out data validation and filter out malformed records to a specified store.
Support monitoring.
Centralized configuration in a secure storage like Azure Key Vault.
More details here: https://learn.microsoft.com/en-us/azure/architecture/databases/architecture/dataops-mdw
Ihor Zahorodnii
Ihor Zahorodnii
Ihor Zahorodnii DataOps for the modern data warehouse This article describes how a fictional city planning office could use this solution. The solution provides an end-to-end data pipeline that follows the MDW architectural pattern, along with corresponding DevOps and DataOps processes, to assess parking use and make more informed business decisions. ArchitectureThe following diagram shows the overall architecture of the solution. DataflowAzure Data Factory (ADF) orchestrates and Azure Data Lake Storage (ADLS) Gen2 stores the data:The Contoso city parking web service API is available to transfer data from the parking spots.There’s an ADF copy job that transfers the data into the Landing schema.Next, Azure Databricks cleanses and standardizes the data. It takes the raw data and conditions it so data scientists can use it.If validation reveals any bad data, it gets dumped into the Malformed schema. ImportantPeople have asked why the data isn’t validated before it’s stored in ADLS. The reason is that the validation might introduce a bug that could corrupt the dataset. If you introduce a bug at this step, you can fix the bug and replay your pipeline. If you dumped the bad data before you added it to ADLS, then the corrupted data is useless because you can’t replay your pipeline.There’s a second Azure Databricks transform step that converts the data into a format that you can store in the data warehouse.Finally, the pipeline serves the data in two different ways:Databricks makes the data available to the data scientist so they can train models.Polybase moves the data from the data lake to Azure Synapse Analytics and Power BI accesses the data and presents it to the business user. ComponentsThe solution uses these components:Azure Data Factory (ADF)Azure DatabricksAzure Data Lake Storage (ADLS) Gen2Azure Synapse AnalyticsAzure Key VaultAzure DevOpsPower BIScenario detailsA modern data warehouse (MDW) lets you easily bring all of your data together at any scale. It doesn’t matter if it’s structured, unstructured, or semi-structured data. You can gain insights to an MDW through analytical dashboards, operational reports, or advanced analytics for all your users.Setting up an MDW environment for both development (dev) and production (prod) environments is complex. Automating the process is key. It helps increase productivity while minimizing the risk of errors.This article describes how a fictional city planning office could use this solution. The solution provides an end-to-end data pipeline that follows the MDW architectural pattern, along with corresponding DevOps and DataOps processes, to assess parking use and make more informed business decisions.Solution requirementsAbility to collect data from different sources or systems.Infrastructure as code: deploy new dev and staging (stg) environments in an automated manner.Deploy application changes across different environments in an automated manner:Implement continuous integration and continuous delivery (CI/CD) pipelines.Use deployment gates for manual approvals.Pipeline as Code: ensure the CI/CD pipeline definitions are in source control.Carry out integration tests on changes using a sample data set.Run pipelines on a scheduled basis.Support future agile development, including the addition of data science workloads.Support for both row-level and object-level security:The security feature is available in SQL Database.You can also find it in Azure Synapse Analytics, Azure Analysis Services (AAS) and Power BI.Support for 10 concurrent dashboard users and 20 concurrent power users.The data pipeline should carry out data validation and filter out malformed records to a specified store.Support monitoring.Centralized configuration in a secure storage like Azure Key Vault.More details here: https://learn.microsoft.com/en-us/azure/architecture/databases/architecture/dataops-mdw Ihor Zahorodnii Ihor Zahorodnii Read More