Building a data pipeline using Dataflow | GCP Dataflow

Data uncover deep insights, support informed decisions, and enhances efficient processes. But when data coming from various sources, in varying formats, and stored across different infrastructures, so here are data pipelines are coming as the first step to centralizing data for reliable business intelligence, operational insights, and analytics. By contrast, the data pipeline is a broader term that encompasses ETL (extract, transform, load) as a subset.

ETL has historically been used for batch workloads, but a new breed of streaming ETL tools is emerging as part of the pipeline for real-time streaming event data.

In both cases, either dealing with a stream or batch data, a unified data processing that’s serverless, fast, and cost-effective is really needed. It’s time to introduce you to a key serverless tool that should be in your data engineering tool kit, Cloud Dataflow, one of the GCP products that help you address key challenges in batch and stream data processing and analytics.

What’s Cloud Dataflow

Cloud Dataflow is the serverless execution service for data processing pipelines written using the Apache beam.

Apache Beam is open-source. You author your pipeline and then give it to a runner. Beam supports multiple runners like Flink and Spark and you can run your beam pipeline on-prem or in Cloud which means your pipeline code is portable.

Why use Cloud Dataflow

  • Fully managed data processing service
    Our first see that dataflow is fully managed and auto-configured. You just deploy your pipeline.
  • OSS community-driven innovation with Apache Beam SDK
    What is pretty to know about Dataflow that it doesn’t just execute Apache beam transforms as is, it optimizes the graph fusing operations efficiently. Also, it doesn’t wait for a previous step to finish before starting a new step as long as there is no dependency.
  • Horizontal autoscaling of worker resources to maximize resource utilization
    Auto-scaling happens step by step in the middle of a job as a job needs more resources it receives more resources automatically.

How does Cloud Dataflow work?

As we mentioned earlier, Data Flow is a unified stream and batch data processing. But how did these distinct processes get combined in one service?

This is the genius of the Apache Beam. It provides abstractions that unified traditional batch programming concepts and traditional data processing concepts.

We have four main concepts:

1. PCollections (Parallel Collections)
— The data is held in a distributed data abstraction called a PCollection.
— The PCollections represent both batch data and streaming data, and there’s no limit size for PCollections.
— It is immutable, which means that any change happens in a pipeline in one PCollection it just creates a new one without making any change on the incoming PCollection.

2. PTransforms
— The actions are code that contained in an abstraction called a PTransform.
— The PTransform handles input, transformations, and output of the data.
— The data in PCollection is passed along the graph from one PTransform to another.

3. Pipelines
— A pipeline identifies the data to be processed and the actions to be taken on the data.
— A pipeline is not actually just a single linear progression, but rather a directed graph with branches and aggregations.

4. Pipeline Runners
Pipeline Runners are analogous to container host such as Kubernetes engine.
— The Identical pipeline can be run on a local computer, data center VMs, or on a service such as Cloud Dataflow in the cloud.
— The service is the runner uses to execute the code is called a back-end system.

The pipeline typically consists of reading data from one or more sources, applying processing to the data, and writing it to one or more sinks.

  • Graph optimization
    In order to execute the pipeline, the dataflow service first optimizes the graph of execution in various ways, for example, fusing transforms together
  • Work Schedule
    Dataflow breaks the jobs into units of work and schedules them to various workers.
  • Intelligent watermarking
    What about if data arrives lately into your streaming pipeline? Cloud Dataflow allows you the flexibility to handle lagged data with intelligent watermarking.
  • Auto-healing
    If some of the VMs failed during the course of the job, Dataflow will auto-heal and distribute the load to the active workers without any need for your intervention.
  • Monitoring
    In addition, you can monitor the service by looking at the logs which Cloud Dataflow will output during and after the processing as well.
  • Dynamic work rebalancer
    One of the cool things about Dataflow is that the optimization is always ongoing and that units of work are continually rebalanced dynamically. What that means is if a certain VM is taking too long for processing task state of flow service can rebalance mid-job.

For example, say you had 15 workers processing and all of them, except three, finished their jobs. You don’t have to wait for these busy workers. Cloud Dataflow will automatically balance out the workload to the idle workers. The overall job finishes faster and Dataflow is using the collections of VMs so it has more efficiently.

USE CASE: ETL Processing on Google Cloud Using Dataflow

In Google Cloud Platform, we use BigQuery as a data warehouse replaces the typical hardware setup for a traditional data warehouse. BigQuery organizes data tables into units called datasets. These datasets are scoped to your Google Cloud project.

project.dataset.table

Using Dataflow, you can build different data pipelines that ingest data from an available dataset into BigQuery. For this you are going to use these Google Cloud services:

  • Cloud Storage
  • Dataflow
  • BigQuery

Before data can be loaded into BigQuery for analytical workloads, it is typically stored in a Cloud Storage product and in a format that is native to its origin. As you expand your footprint in Google Cloud, you will probably capture your source data directly in Bigtable, Datastore, or Cloud Spanner and use Dataflow to ETL data into BigQuery in batch or streams.

Dataflow SQL

Dataflow SQL is a functionality of Dataflow that allows you to create pipelines directly from a BigQuery query. Let’s see how to create a Dataflow SQL job, write and run a Dataflow SQL query.

Using the Dataflow SQL UI

1- Go to the BigQuery web UI.

2- Switch to the Cloud Dataflow engine.

a. Click the More drop-down menu and select Query settings.

b. In the Query settings menu, select Dataflow engine.

c. In the prompt that appears if the Dataflow and Data Catalog APIs are not enabled, click Enable APIs.

d. Click Save.

3- After that, you can click in Create Cloud Dataflow job and your query will become a job in Dataflow.

— Dataflow SQL queries are similar to BigQuery standard SQL.
Currently Dataflow supports three different sources:

  • Pub/Sub topic
  • BigQuery table
  • Cloud Storage filesets

And supports two targets:

  • Pub/Sub topic
  • BigQuery table

From this you can see that you can use Dataflow SQL to join a stream of data from Pub/Sub with data from a BigQuery table.

Dataflow automatically chooses the settings that are optimal for your Dataflow SQL job.

Finally, data flow functions as the glue that ties together many of the services on gcp. It’s interoperable. Do you need to read data from bigquery and right to Big Table use data flow? She needs to read from pubsub and right to Cloud SQL. Yep, use data flow.

Alia Amr

Alia Amr is a data engineer who has practical experience in the fields of designing and building various data solutions. She enjoys optimizing data flow and data pipeline architecture and building them from the ground up, also, improve data reliability and quality.

guest
1 Comment
Inline Feedbacks
View all comments
Mohamed Hisham
Mohamed Hisham
October 14, 2020 11:45 pm

Good Article 👌, Specially its topic related to building Data engine using GCP
Go on 👍