Transforming Batch Scripts to Streaming with Apache Beam

Introduction to Apache Beam

Is it feasible to convert a batch processing script to a streaming one without complications? Absolutely! With Apache Beam, this transformation is quite straightforward.

Data Pipeline Basics

To kick things off, I crafted a basic script using Apache Beam’s Direct Runner, enabling local execution rather than on a Spark or Dataflow engine. The dataset utilized consists of flight information, encompassing details such as flight number, origin, destination, and delays in both departure and arrival.

Filtering Flight Data

I developed a routine that filters records with a positive arrival delay from column 8 (starting at index 0) and retrieves the associated airport information from column 4. The script is as follows:

# Sample script code here

Let’s break down the script:

Lines 5 and 6: Establish the pipeline as P1.
Lines 8 and 9: Set the key.json file for Google authentication.
Line 13: Import the bucket file into Google Cloud Platform (GCP).
Line 14: Split the records by commas.
Line 15: Filter for delays greater than 0 in column 8.
Line 16: Select relevant columns for flight ID and delay.
Line 17: Output the results.

Sounds simple, right? But what if I need to view this data in real-time?

Transitioning to Real-Time Data

To achieve real-time insights, I implemented a Python routine that sends data lines to PubSub at regular intervals. (Details on this process will not be covered here.) Consequently, I modified the script for streaming:

# Adjusted script code here

Here’s what I changed:

ADDED Line 6: This indicates the pipeline is now for streaming.
CHANGED Line 14: The data source transitioned from a GCP bucket file to a PubSub subscription for real-time message handling.
CHANGED Line 15: I decoded PubSub messages before splitting since they are transmitted in bytes.
ADDED Line 22: I adjusted the pipeline to run continuously rather than just once.

And that’s it!

While the example is straightforward, it illustrates the differences between executing a script in Apache Beam for batch processing and streaming using the Direct Runner.

Further Learning Resources

If you're keen to dive deeper, I offer an introductory course on this technology on Udemy. Note that the English version does not cover streaming due to technical issues.

Links to Courses:

Additional Insights for Data Engineers

Key concepts every Data Engineer should be familiar with include:

Data Modeling, CDC, Idempotency, ETL vs. ELT, Kappa vs. Lambda Architectures, Slowly Changing Dimensions (SCD), and more.

Software Engineering with Python:

Foundations, Modules, Classes, and Maintainability.

Efficiency in Python:

Starting with the Basics, Tools for Code Evaluation, Performance Enhancement, and Pandas Optimization.

You can access and enroll in the courses here.

Chapter 1: Introduction to Streaming with Apache Beam

This video titled "How to Write Batch or Streaming Data Pipelines with Apache Beam in 15 mins with James Malone" provides a concise guide to creating data pipelines.

Chapter 2: Unified Batch and Streaming Pipelines

In this video, "How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow," you'll learn how to effectively manage both types of data processing.

didismusings.com

Transforming Batch Scripts to Streaming with Apache Beam

Introduction to Apache Beam

Data Pipeline Basics

Filtering Flight Data

Transitioning to Real-Time Data

Further Learning Resources

Additional Insights for Data Engineers

Chapter 1: Introduction to Streaming with Apache Beam

Chapter 2: Unified Batch and Streaming Pipelines

Share the page:

Recent Post:

# Understanding the Cycle of Narcissistic Relationships

Is Writing Your Own Story Easier? A Personal Reflection

The Bizarre Incident of Meat Falling from the Sky