Jul 15, 2023

PyFlink Series (Intro)

Why Flink?

So you need to add stream processing to your stack.

Maybe you’re needs are to:

Provide near real time (NRT) capabilities
Orchestrate ML/AI training and augmentation pipelines
Support Event Driven Architecture
Stream your data and process in flight
Reduce size and costs of traditional data lake patterns
Modernize your data engineering efforts

There are of course many other reasons to stream.

One I see a lot in the wild is when organizations are trying to reduce latency in their data delivery so that the data is more actionable.

Maybe they are using Spark or DBT and working to get their scheduled tasks as small as possible. You’ll hear terms like “micro batching”, and what folks are really trying to get to is streaming.

You can batch with streaming, after all that’s just a bounded stream of batch size (n).

However, no matter how “micro” you make your batches in a batch-oriented system, you’ll never quite achieve streaming.

This is especially true when you consider higher ordered events.

Why (Py)Flink?

Pyflink

Maybe you’re on a team of elite pythonistas.

Perhaps python’s vast ecosystem of packages offer the best fit for your use case (ML anyone?).

You may have use cases that can’t be easily resolved with Flink’s pure SQL interface, but organizational issues prevent the use of JVM languages directly.

Fortunately, the Apache Flink ecosystem has us covered with PyFlink, pythonic bindings to the underlying JVM constructs using Py4J.

Using the Py4J bridge, you can directly interact with “real” Flink JVM objects from python, and Flink’s various APIs can call back to your python objects as well.

**Some (very)high throughput use cases may have performance impacts when using PyFlink vs JVM directly, but Flink has escape hatches.

Let’s dive in!

The code used in the following examples can be accessed here over at github.com

We’re not going to get into any specific processing goals in this post, instead we’ll focus on getting a local setup running that can then be deployed to AWS MAF.

Setting up a job

After cloning the repo, setup your virtual environment by running:

make fresh

This will install a virtual environment, activate it, and install the required dependencies.

Make sure you have python 3.8 or greater installed as well as the virtualenv package.

Up next: Part 1 (Basic Data Stream API Job)