Azure Data Factory – Modern ETL On Cloud – Data Migration Use Case | Azure Data Factory

Introduction

ETL is one of the major tasks for any data engineer, and we have many solutions either on-premise or cloud solutions available in the market to implement this concept, in Microsoft Azure, Azure Data Factory is the ETL solution to implement data pipelines using data from the cloud source or data from on-premise sources, Azure Data Factory is very flexible, GUI ETL solution which provides different types of connectors to on-premise and cloud sources, in this article, we will go through the basic components of Azure Data Factory and we will implement a Full Use Case step by step.

Data Factory Components

Dataflow

One of the main building blocks of data factory which represents simply the ETL job. Any ETL job has three components Source Dataset(s), Transformations, Target Dataset(s). In Data Factory we call ETL job as a Dataflow

Activity

Activity is one task or a collection of tasks that can be executed sequentially or in parallel, we have different types of tasks we can execute such as a Dataflow task which we already discussed in previous point, Copy Data task, Databricks task, and so on.

Linked Service

Linked Service is the connection details for a specific source, to connect to any data source or an kind of services in Data Factory we need to create a Linked Service for this source or service, you can think of it as a connection statement with all required details to connect to a specific source or service.

Pipeline

The pipeline is a workflow to orchestrate a group of activities together in sequential or parallel order. A common use case is when you have a group of activities that are related to the same use case and affect the output data you will need to create a pipeline to control the flow of data and dependencies between activities. The conclusion is that the Pipeline is a logical grouping of activities.

Use Case

In this use case, we will use Copy Data activity to migrate data from Blob Storage to SQL Server database, Objective from this use case is to walk you through the development process of Data Factory, and see in action Data Factory components.

1- Login to Azure portal

2- Click on Create a Resource

3- Search for Data Factory and click on it

4- Click on Create

5- Choose Azure Subscription for your billing, Resource group, region, name of your data factory, in our case we will name our Data Factory as dvdevelopmentdf

6- You can link your data factory development with a Git repository either Github or Azure DevOps Repository , click on Configure Git Later

7- If needed add Tag to your Data Factory, Tags enables you track expenses, for example, we added team tag to track all resources that has the same team tag value

8- Review settings and click Create

9- Wait till resource is created successfully then click on Go to Resource

10- Click on Author & Monitor

11- Next, we need to create Linked Services (Connections) to our source and target for our Copy Data Activity, to do that click on Manage

12- These two demos will show you the process of creating a Linked Service for SQL and Storage services

SQL Connection

Storage Connection

13- Now, we are ready to create a Copy Data Activity, go back to the main Data Factory window, click on Data Factory, then click on Copy Data

14- Enter Task details, Task Name –> Copy Data Task name, then choose Task schedule to run it once or run it based on schedule

15- Next, we will enter source details, choose source connection

Choose the path to your file

Then choose file configurations, like file format, column delimiters, row delimiters, then check on First Row Header in case the first row has the column names, and check sample of data to make sure data is readed successfully

16- Choose Target Connection, in our case it is SQL connection

17- We can create target table dynamically, or choose an existing table from the target connection, in our case we will create the table dynamically using the source file schema

18- Next, we will choose column mapping between source and target columns, also we can change data types for the output columns

19- The last step is to configure general settings for our task, check Data Consistency verification to let data factory run validation tests between source and target data, choose from the list of fault tolerance approaches, here we will choose to skip incompatible rows, Enable Logging to log task events to a destination, here we selected storage connection, then we clicked on Browse and select a directory we already created for the logs, click Next

20- Review Summary, then click Next

21- The last step is deployment step, data factory will validate our task, create a target dataset, create a pipeline for our activity, then will execute our pipeline. Note, pipeline that will be created will contain only one activity which is our copy data activity

Go to Author window, and you can see we have a pipeline with our activity

22- Check Monitor window, you can see our pipeline executed successfully

Now let’s check the data in the target table to validate our pipeline

Congratulations, you have your first pipeline executed successfully, now you have a practical sense of Data Factory process end to end and of the Data Factory components.


Check our latest articles

Ahmed Ibrahem

Ahmed Ibrahem is working as Data Engineering Team lead, with a wide experience in data management projects and technical implementation using different technologies, and delivering end to end projects starting from business analysis to Data Warehouse modeling and implementation to BI design to customers in different industries.

guest
0 Comments
Inline Feedbacks
View all comments