didismusings.com

Transforming Batch Scripts to Streaming with Apache Beam

Written on

Introduction to Apache Beam

Is it feasible to convert a batch processing script to a streaming one without complications? Absolutely! With Apache Beam, this transformation is quite straightforward.

Apache Beam Overview

Data Pipeline Basics

To kick things off, I crafted a basic script using Apache Beam’s Direct Runner, enabling local execution rather than on a Spark or Dataflow engine. The dataset utilized consists of flight information, encompassing details such as flight number, origin, destination, and delays in both departure and arrival.

Flight Data Sample

Filtering Flight Data

I developed a routine that filters records with a positive arrival delay from column 8 (starting at index 0) and retrieves the associated airport information from column 4. The script is as follows:

# Sample script code here

Let’s break down the script:

  • Lines 5 and 6: Establish the pipeline as P1.
  • Lines 8 and 9: Set the key.json file for Google authentication.
  • Line 13: Import the bucket file into Google Cloud Platform (GCP).
  • Line 14: Split the records by commas.
  • Line 15: Filter for delays greater than 0 in column 8.
  • Line 16: Select relevant columns for flight ID and delay.
  • Line 17: Output the results.

Sounds simple, right? But what if I need to view this data in real-time?

Transitioning to Real-Time Data

To achieve real-time insights, I implemented a Python routine that sends data lines to PubSub at regular intervals. (Details on this process will not be covered here.) Consequently, I modified the script for streaming:

# Adjusted script code here

Here’s what I changed:

  • ADDED Line 6: This indicates the pipeline is now for streaming.
  • CHANGED Line 14: The data source transitioned from a GCP bucket file to a PubSub subscription for real-time message handling.
  • CHANGED Line 15: I decoded PubSub messages before splitting since they are transmitted in bytes.
  • ADDED Line 22: I adjusted the pipeline to run continuously rather than just once.

And that’s it!

While the example is straightforward, it illustrates the differences between executing a script in Apache Beam for batch processing and streaming using the Direct Runner.

Further Learning Resources

If you're keen to dive deeper, I offer an introductory course on this technology on Udemy. Note that the English version does not cover streaming due to technical issues.

Udemy Course on Data Engineering

Links to Courses:

Additional Insights for Data Engineers

Key concepts every Data Engineer should be familiar with include:

  • Data Modeling, CDC, Idempotency, ETL vs. ELT, Kappa vs. Lambda Architectures, Slowly Changing Dimensions (SCD), and more.

Software Engineering with Python:

  • Foundations, Modules, Classes, and Maintainability.

Efficiency in Python:

  • Starting with the Basics, Tools for Code Evaluation, Performance Enhancement, and Pandas Optimization.
Python Efficiency Techniques

You can access and enroll in the courses here.

Chapter 1: Introduction to Streaming with Apache Beam

This video titled "How to Write Batch or Streaming Data Pipelines with Apache Beam in 15 mins with James Malone" provides a concise guide to creating data pipelines.

Chapter 2: Unified Batch and Streaming Pipelines

In this video, "How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow," you'll learn how to effectively manage both types of data processing.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# Understanding the Cycle of Narcissistic Relationships

Explore the tumultuous nature of narcissistic relationships, highlighting the cycle of love, manipulation, and despair.

Is Writing Your Own Story Easier? A Personal Reflection

A personal exploration of the challenges and ease of writing one's own stories versus others.

The Bizarre Incident of Meat Falling from the Sky

Explore the strange 19th-century event when meat inexplicably fell from the sky in Kentucky, and the theories surrounding it.