Introduction to Apache Flink | Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Apache Flink is powerful open source engine which provides:

  • Batch Processing
  • Interactive Processing
  • Real-time (Streaming) Processing
  • Graph Processing
  • Iterative Processing
  • In-memory Processing

It can handle all these types of requirements So; I have collected what all are the different requirements in the industry and flink actually can address all of them earlier. We need use multiple frameworks like for badge use MapReduce and for streaming use storm but that was very complex here in the flink just single unified platform to address all the types of requirements that to with lightning, fast, speed easy of used sophisticated analytics.

Advantage of Flink:

  • Flink is true streaming engine. It doesn’t actually cut the stream into micro batches like a spark, it processes the data as soon as it arrives
  • Flink’s core is a streaming data flow engine that provides distribution, communication and fault tolerance for distributed computations
  • Flink is General Purpose Framework which targets to unify different data loads. not need of different specialized engine, use a single unified platform called Apache Flink for all your requirements.
  •  process events at a consistently high rate with latency as low as milliseconds
  • Flink is an open Source Platform for distributed stream and batch processing
  • Large scale data processing

Ecosystem:

From the following graph we can see all layers of Flink core layers and other available layers on top of Flink core layer

Run time (Kernel):

At the core at the heart we have run time. Run time is the core of Apache Flink. It’s also known as kernel of flink, it’s distributed streaming data flow. there is at the core, it is a steaming data engine.

APIs and Libraries:

On the top of the Run time layer, we have several APIs and Libraries available, so in the broad category we have like:

  1. Dataset APIs that is for batch processing
    • ML for machine learning
    • Gelly for graph processin
    • Table for SQL processing.
  2. Data streaming API for stream processing
    • Table

Now if you observe this particular ecosystem this is just a processing engine there is no storage layer, so flink is dependent on third party storage system

Deploy:

Flink can be deploy one of this mode.

  • local mode is used for development and testing purpose, deploy on
    • Local machine (single JVM)
  • cluster Mode, deploy on
    • Standalone
    • Yarn (usually used)
    • Mesos
    • Tez
  • Cloud Mod, deploy on
    • Google Compute Engine (GCE)
    • Amazon EC2

Storage:

Flink can read data from various storage system like.

  1. Local file system
    • Local file system
    • HDFS
    • S3
  2. Database
    • MongoDB
    • HBase
    • Even from relational database
  3. Streams
    • RabbitMQ
    • Kafka
    • Flume

Ahmed Hesham

Ahmed Hesham is a Data Engineer with a great passion in designing and developing data engineering solutions, he is specialized in designing and implementing data engineering solutions using variety of data engineering platforms and tools.

guest
0 Comments
Inline Feedbacks
View all comments