Dynamic Broadcast Variable Update - nipunarora/spark GitHub Wiki

Table of Contents

Introduction

In this document, we describe the design of a dynamic broadcast variable update patch to spark streaming. Since broadcast variables are initiated as immutable variables before the spark streaming context is started, developers are unable to update them afterward. Our patch provides a much-needed feature, which allows developers to update broadcast variables, without requiring shutdown or downtime of the streaming process.

Key-points

  1. Dynamic broadcast variable update, with zero downtime
  2. Compact design (does not require external in-memory database, hence keeping a small eco-system)

Use case

In this section, I describe a use-case for a dynamic broadcast model update. We have used this feature for the past year for a streaming log analysis solution, where we are generating models from log sources.

The analysis is divided into two stages: an offline (non-streaming) which runs periodically to generate updated models, and an online (streaming stage), which processes incoming logs in real-time based on the models learned offline. The models are stored in broadcast variables and used during the streaming process.

Design

When a broadcast variable is added at the driver

var b = ssc.sparkContext.broadcast(i)

  • Internally, spark creates a block of memory at the driver node where each broadcast variable is stored with a spark defined id.