Home ‐ general - puneet3663/databricks GitHub Wiki

image

image

image

image

A watermark in Spark Structured Streaming is a mechanism to handle late-arriving data by defining a threshold for how long Spark waits for late events during windowed aggregations or stateful streaming operations. By specifying a watermark using .withWatermark("event_time_column", "delay"), Spark tracks the maximum event time seen and will only wait for data up to the specified delay after the event time before finalizing results and freeing memory.

For example, if a watermark of 10 minutes is set, Spark will wait for late data up to 10 minutes after the window ends before considering the window complete. This ensures more accurate results when data can arrive late, but also prevents unbounded memory growth by allowing old state to be discarded

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image