A side-by-side comparability of Apache Spark and Apache Flink for frequent streaming use circumstances

Spread the love

Apache Flink and Apache Spark are each open-source, distributed information processing frameworks used broadly for large information processing and analytics. Spark is understood for its ease of use, high-level APIs, and the flexibility to course of giant quantities of information. Flink shines in its capability to deal with processing of information streams in real-time and low-latency stateful computations. Each assist quite a lot of programming languages, scalable options for dealing with giant quantities of information, and a variety of connectors. Traditionally, Spark began out as a batch-first framework and Flink started as a streaming-first framework.

On this put up, we share a comparative research of streaming patterns which might be generally used to construct stream processing purposes, how they are often solved utilizing Spark (primarily Spark Structured Streaming) and Flink, and the minor variations of their method. Examples cowl code snippets in Python and SQL for each frameworks throughout three main themes: information preparation, information processing, and information enrichment. In case you are a Spark consumer seeking to remedy your stream processing use circumstances utilizing Flink, this put up is for you. We don’t intend to cowl the selection of know-how between Spark and Flink as a result of it’s necessary to guage each frameworks to your particular workload and the way the selection matches in your structure; quite, this put up highlights key variations to be used circumstances that each these applied sciences are generally thought-about for.

Apache Flink provides layered APIs that provide totally different ranges of expressiveness and management and are designed to focus on various kinds of use circumstances. The three layers of API are Course of Features (often known as the Stateful Stream Processing API), DataStream, and Desk and SQL. The Stateful Stream Processing API requires writing verbose code however provides probably the most management over time and state, that are core ideas in stateful stream processing. The DataStream API helps Java, Scala, and Python and provides primitives for a lot of frequent stream processing operations, in addition to a stability between code verbosity or expressiveness and management. The Desk and SQL APIs are relational APIs that provide assist for Java, Scala, Python, and SQL. They provide the very best abstraction and intuitive, SQL-like declarative management over information streams. Flink additionally permits seamless transition and switching throughout these APIs. To study extra about Flink’s layered APIs, discuss with layered APIs.

Apache Spark Structured Streaming provides the Dataset and DataFrames APIs, which offer high-level declarative streaming APIs to characterize static, bounded information in addition to streaming, unbounded information. Operations are supported in Scala, Java, Python, and R. Spark has a wealthy operate set and syntax with easy constructs for choice, aggregation, windowing, joins, and extra. You may also use the Streaming Desk API to learn tables as streaming DataFrames as an extension to the DataFrames API. Though it’s onerous to attract direct parallels between Flink and Spark throughout all stream processing constructs, at a really excessive stage, let’s imagine Spark Structured Streaming APIs are equal to Flink’s Desk and SQL APIs. Spark Structured Streaming, nonetheless, doesn’t but (on the time of this writing) supply an equal to the lower-level APIs in Flink that provide granular management of time and state.

Each Flink and Spark Structured Streaming (referenced as Spark henceforth) are evolving initiatives. The next desk supplies a easy comparability of Flink and Spark capabilities for frequent streaming primitives (as of this writing).

. Flink Spark
Row-based processing Sure Sure
Person-defined capabilities Sure Sure
Fantastic-grained entry to state Sure, through DataStream and low-level APIs No
Management when state eviction happens Sure, through DataStream and low-level APIs No
Versatile information constructions for state storage and querying Sure, through DataStream and low-level APIs No
Timers for processing and stateful operations Sure, through low stage APIs No

Within the following sections, we cowl the best frequent elements in order that we are able to showcase how Spark customers can relate to Flink and vice versa. To study extra about Flink’s low-level APIs, discuss with Course of Perform. For the sake of simplicity, we cowl the 4 use circumstances on this put up utilizing the Flink Desk API. We use a mix of Python and SQL for an apples-to-apples comparability with Spark.

Information preparation

On this part, we examine information preparation strategies for Spark and Flink.

Studying information

We first take a look at the best methods to learn information from an information stream. The next sections assume the next schema for messages:

image: string,
worth: int,
timestamp: timestamp,
    title: string,
    employees_count: int

Studying information from a supply in Spark Structured Streaming

In Spark Structured Streaming, we use a streaming DataFrame in Python that immediately reads the information in JSON format:

spark = ...  # spark session

# specify schema
stock_ticker_schema = ...

# Create a streaming DataFrame
df = spark.readStream 
    .choice("kafka.bootstrap.servers", "mybroker1:port") 
    .choice("matter", "stock_ticker") 
    .choose(from_json(col("worth"), stock_ticker_schema).alias("ticker_data")) 

Observe that now we have to provide a schema object that captures our inventory ticker schema (stock_ticker_schema). Evaluate this to the method for Flink within the subsequent part.

Studying information from a supply utilizing Flink Desk API

For Flink, we use the SQL DDL assertion CREATE TABLE. You possibly can specify the schema of the stream identical to you’d any SQL desk. The WITH clause permits us to specify the connector to the information stream (Kafka on this case), the related properties for the connector, and information format specs. See the next code:

# Create desk utilizing DDL

CREATE TABLE stock_ticker (
  image string,
  worth INT,
  timestamp TIMESTAMP(3),
  company_info STRING,
  WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE
) WITH (
 'connector' = 'kafka',
 'matter' = 'stock_ticker',
 'properties.bootstrap.servers' = 'mybroker1:port',
 'properties.group.id' = 'testGroup',
 'format' = 'json',
 'json.fail-on-missing-field' = 'false',
 'json.ignore-parse-errors' = 'true'

JSON flattening

JSON flattening is the method of changing a nested or hierarchical JSON object right into a flat, single-level construction. This converts a number of ranges of nesting into an object the place all of the keys and values are on the similar stage. Keys are mixed utilizing a delimiter resembling a interval (.) or underscore (_) to indicate the unique hierarchy. JSON flattening is helpful when you have to work with a extra simplified format. In each Spark and Flink, nested JSONs will be sophisticated to work with and may have extra processing or user-defined capabilities to govern. Flattened JSONs can simplify processing and enhance efficiency on account of decreased computational overhead, particularly with operations like complicated joins, aggregations, and windowing. As well as, flattened JSONs might help in simpler debugging and troubleshooting information processing pipelines as a result of there are fewer ranges of nesting to navigate.

JSON flattening in Spark Structured Streaming

JSON flattening in Spark Structured Streaming requires you to make use of the choose technique and specify the schema that you simply want flattened. JSON flattening in Spark Structured Streaming entails specifying the nested discipline title that you simply’d like surfaced to the top-level record of fields. Within the following instance, company_info is a nested discipline and inside company_info, there’s a discipline referred to as company_name. With the next question, we’re flattening company_info.title to company_name:

stock_ticker_df = ...  # Streaming DataFrame w/ schema proven above

stock_ticker_df.choose("image", "timestamp", "worth", "company_info.title" as "company_name")

JSON flattening in Flink

In Flink SQL, you need to use the JSON_VALUE operate. Observe that you need to use this operate solely in Flink variations equal to or higher than 1.14. See the next code:

   JSON_VALUE(company_info, 'lax $.title' DEFAULT NULL ON EMPTY) AS company_name

The time period lax within the previous question has to do with JSON path expression dealing with in Flink SQL. For extra info, discuss with System (Constructed-in) Features.

Information processing

Now that you’ve learn the information, we are able to take a look at a number of frequent information processing patterns.


Information deduplication in stream processing is essential for sustaining information high quality and making certain consistency. It enhances effectivity by decreasing the pressure on the processing from duplicate information and helps with price financial savings on storage and processing.

Spark Streaming deduplication question

The next code snippet is expounded to a Spark Streaming DataFrame named stock_ticker. The code performs an operation to drop duplicate rows primarily based on the image column. The dropDuplicates technique is used to eradicate duplicate rows in a DataFrame primarily based on a number of columns.

stock_ticker = ...  # Streaming DataFrame w/ schema proven above


Flink deduplication question

The next code exhibits the Flink SQL equal to deduplicate information primarily based on the image column. The question retrieves the primary row for every distinct worth within the image column from the stock_ticker stream, primarily based on the ascending order of proctime:

SELECT image, timestamp, worth
    ROW_NUMBER() OVER (PARTITION BY image ORDER BY proctime ASC) AS row_num
  FROM stock_ticker)
WHERE row_num = 1


Windowing in streaming information is a basic assemble to course of information inside specs. Home windows generally have time bounds, variety of information, or different standards. These time bounds bucketize steady unbounded information streams into manageable chunks referred to as home windows for processing. Home windows assist in analyzing information and gaining insights in actual time whereas sustaining processing effectivity. Analyses or operations are carried out on continuously updating streaming information inside a window.

There are two frequent time-based home windows used each in Spark Streaming and Flink that we’ll element on this put up: tumbling and sliding home windows. A tumbling window is a time-based window that may be a fastened measurement and doesn’t have any overlapping intervals. A sliding window is a time-based window that may be a fastened measurement and strikes ahead in fastened intervals that may be overlapping.

Spark Streaming tumbling window question

The next is a Spark Streaming tumbling window question with a window measurement of 10 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema proven above

# Get max inventory worth in tumbling window
# of measurement 10 minutes
visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
      window(stock_ticker.timestamp, "10 minutes"),

Flink Streaming tumbling window question

The next is an equal tumbling window question in Flink with a window measurement of 10 minutes:

SELECT image, MAX(worth)
    TUMBLE(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '10' MINUTES))
  GROUP BY ticker;

Spark Streaming sliding window question

The next is a Spark Streaming sliding window question with a window measurement of 10 minutes and slide interval of 5 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema proven above

# Get max inventory worth in sliding window
# of measurement 10 minutes and slide interval of measurement
# 5 minutes

visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
      window(stock_ticker.timestamp, "10 minutes", "5 minutes"),

Flink Streaming sliding window question

The next is a Flink sliding window question with a window measurement of 10 minutes and slide interval of 5 minutes:

SELECT image, MAX(worth)
    HOP(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
  GROUP BY ticker;

Dealing with late information

Each Spark Structured Streaming and Flink assist occasion time processing, the place a discipline throughout the payload can be utilized for outlining time home windows as distinct from the wall clock time of the machines doing the processing. Each Flink and Spark use watermarking for this objective.

Watermarking is utilized in stream processing engines to deal with delays. A watermark is sort of a timer that units how lengthy the system can watch for late occasions. If an occasion arrives and is throughout the set time (watermark), the system will use it to replace a request. If it’s later than the watermark, the system will ignore it.

Within the previous windowing queries, you specify the lateness threshold in Spark utilizing the next code:

.withWatermark("timestamp", "3 minutes")

Because of this any information which might be 3 minutes late as tracked by the occasion time clock shall be discarded.

In distinction, with the Flink Desk API, you may specify an identical lateness threshold immediately within the DDL:

WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE

Observe that Flink supplies extra constructs for specifying lateness throughout its varied APIs.

Information enrichment

On this part, we examine information enrichment strategies with Spark and Flink.

Calling an exterior API

Calling exterior APIs from user-defined capabilities (UDFs) is analogous in Spark and Flink. Observe that your UDF shall be referred to as for each report processed, which may end up in the API getting referred to as at a really excessive request fee. As well as, in manufacturing situations, your UDF code typically will get run in parallel throughout a number of nodes, additional amplifying the request fee.

For the next code snippets, let’s assume that the exterior API name entails calling the operate:

response = my_external_api(request)

Exterior API name in Spark UDF

The next code makes use of Spark:

class Predict(ScalarFunction):
def open(self, function_context):

with open("sources.zip/sources/mannequin.pkl", "rb") as f:
self.mannequin = pickle.load(f)

def eval(self, x):
return self.mannequin.predict(x)

Exterior API name in Flink UDF

For Flink, assume we outline the UDF callExternalAPIUDF, which takes as enter the ticker image image and returns enriched details about the image through a REST endpoint. We are able to then register and name the UDF as follows:

callExternalAPIUDF = udf(callExternalAPIUDF(), result_type=DataTypes.STRING())

    callExternalAPIUDF(image) as enriched_symbol
FROM stock_ticker;

Flink UDFs present an initialization technique that will get run one time (versus one time per report processed).

Observe that you need to use UDFs judiciously as an improperly carried out UDF may cause your job to decelerate, trigger backpressure, and finally stall your stream processing utility. It’s advisable to make use of UDFs asynchronously to keep up excessive throughput, particularly for I/O-bound use circumstances or when coping with exterior sources like databases or REST APIs. To study extra about how you need to use asynchronous I/O with Apache Flink, discuss with Enrich your information stream asynchronously utilizing Amazon Kinesis Information Analytics for Apache Flink.


Apache Flink and Apache Spark are each quickly evolving initiatives and supply a quick and environment friendly approach to course of huge information. This put up centered on the highest use circumstances we generally encountered when clients needed to see parallels between the 2 applied sciences for constructing real-time stream processing purposes. We’ve included samples that had been most continuously requested on the time of this writing. Tell us for those who’d like extra examples within the feedback part.

In regards to the creator

Deepthi Mohan is a Principal Product Supervisor on the Amazon Kinesis Information Analytics workforce.

Karthi Thyagarajan was a Principal Options Architect on the Amazon Kinesis workforce.

Leave a Reply

Your email address will not be published. Required fields are marked *