Introduction to the historical version of Flink (I)

If you don't understandFlinkWhat is it? You can check out my previous introduction article:Introduction to Flink

If you want to learn flink with me, please check the subscription column:Flink column

This article lists important features in each release of Flink, which shows how Flink has evolved to this day step by step.

Flink was predecessored by the Stratosphere project, which was originally developed by a research team at the Berlin University of Technology. Stratosphere project aims to achieve highperformanceofBig data processingand analysis.

0.6.0

Posted in 2014, this isApacheThe first version inside, named Flink. It has the following characteristics:

Definition: ApacheFlink is a general-purpose data processing engine for clusters.
Jobs are executed through Flink's runtime engine.
Data is stored in Hadoop HDFS
Supported explorers: standalone, Hadoop YARN
Supported programming languages: Java, Scala

0.7.0

Published in 2014, the following new features are introduced:

FlinkStreaming: Provides a Java API that can process streaming data sources (such as Apache Kafka, Apache Flume, etc.) in real time.
Streaming Scala API: Scala and Java now have the same syntax and Transformation and remain synchronized in later versions.

0.8.0

Published in 2015, the following new features are introduced:

Extended file system: All file systems supported by Hadoop are now available in Flink
Support windows: A new window API was introduced to create windows, delete window elements, etc.
hadoop 2.2.0+ as the default dependency

0.9.0

Published in 2015, the following new features are introduced:

Introducing Table API:/flink/flink-docs-release-0.9/libs/
Introduce Gelly for graph processing:/2015/08/24/introducing-gelly-graph-processing-with-apache-flink/
Introducing machine learning library:/flink/flink-docs-release-0.9/libs/ml/
Introducing the Akka framework as Flink's RPC system:/
Introduce exact-once semantic guarantees for stream processing.

0.10.0

Published in 2015, the following new features are introduced:

Supports stream processing of event-time, ingestion-time, and processing-time
Support stateful stream processing: provides interfaces for defining, updating, and querying operator state (operator state)
Support for high availability: Introduce Zookeeper, support setting high availability mode for standalone clusters and YARN
Improve the DataStream API, introduce stream partitioning and window operators, and window design isDataFlowModel inspired by the concepts of window assigners, triggers and evictors
Introducing a new Connector:ElasticSearchandApache Nifi
Introduce a new Web Dashboard: You can view the progress of running the job and display real-time statistics on the amount of processed data and record counts. In addition, it provides access to TaskManager's resource usage and JVM statistics, including JVM heap usage and garbage collection details.
Introducing off-heap managed memory: managed memory can be allocated from off-heap memory, which is conducive to reducing the startup time of TaskManager and reducing garbage collection pressure.
Added native support for left, right, and outer join for DataSet API.

1.0.0

Published in 2016, the following new features are introduced:

supportRocksDBAs an implementation of the state backend, it is used to store state. RocksDB is an embedded key/value storage database originally developed by Facebook. When using this backend, the active state in the streaming program can be far beyond memory. RocksDB files are stored in a distributed file system like HDFS or S3 for backup.
Introduced savepoints: savepoints are checkpoints of the running stream job state, which users can manually trigger when the job is running. savepoints solves some production challenges, including code upgrades (applications and frameworks), cluster maintenance and migration, A/B testing and assumption scenarios, as well as testing and debugging.
Introduce a complex event processing library (CEP) to detect special patterns of complex event streams:CEP on Flink
Enhanced monitoring function: (1) Allow job submission; (2) Introduce backpressure monitoring
kafka connector enhancement: Allows multiple topics to be subscribed in 1 source

1.1.0

Published in 2016, the following new features are introduced:

connectors：
- Supports file as Source
- Support Kinesis as Source and Sink
- Support Cassandra as Sink

Table API and SQL: Supports more scalar functions, and supports external source and sink.
DataStream API: supports session windows and supports processing of delayed elements.
CEP supports Scala API
Allows the collection and exposure of job operation metrics to external systems

1.2.0

Published in 2017, the following new features are introduced:

Parallelism: Supports configuration of maximum parallelism; supports configuration of parallelism between jobs and operators; supports modification of parallelism (start jobs with new parallelism from savepoint)
Extensible non-partitioned state: Added an extensible non-partitioned state for operators like KafkaConsumer that do not use keyed state but rather operator state. To achieve scalability (parallelism change), operator state needs to be reassigned between parallel instances.
ProcessFunction: An underlying stream processing operation that allows access to the underlying building blocks of the stream application: Events (stream elements), State (consistency, fault tolerance), Timers (event time, processing time)
Asynchronous IO: Introduces a dedicated Async I/O operator to issue blocking calls asynchronously in a checkpoint manner
Added a new deployment method: Supports running a highly available Flink cluster on Apache Mesos.
Supports authentication of external services using Kerberos, such as Zookeep, Kafka, HDFS, and YARN.
Queryable State: Allows users to query the current state of the operator.
Application upgrade: Allows users to restart jobs from savepoint generated by version 1.1.4 without losing status.
Table API and SQL functionality enhancements
- Streaming Table supports adding scrolling, swiping, and session group window aggregation
- Supports more built-in SQL functions and operations, such as EXISTS, VALUES, LIMIT, CURRENT_DATE, INITCAP, NULLIF

1.3.0

Published in 2017, the following new features are introduced:

Large state processing/recovery
- RocksDB state backend supports incremental checkpointing, which can speed up checkpointing and save disk space
- File system and memory state backend support asynchronous snapshots implemented using copy-on-write hashmap
- Serializer that allows upgraded state
- Allows recovery of job state on operator granularity
- In the event of job failure, you can restart only the affected subgraph in the execution graph instead of restarting the entire execution graph
DataStream API
- Introduced Side Outputs: Allows one operator to have multiple output streams. The window operator can now output delayed window elements to side outputs
- Introducing the Union Operator State API, which is used to send all states to all parallel instances, and is generally used to send broadcast state
- Introduce Per-window State, open window state and window keyed state independently
Deployment and Tools
- HistoryServer: Allows to view statistics on the status of completed jobs
- watermark monitoring: The web front-end can view watermark of each operator
- Network buffer configuration: No absolute value is used anymore, 10% of JVM heap memory is used by default, and the minimum and maximum percentages can also be configured.

1.4.0

Published in 2017, the following new features are introduced:

TwoPhaseCommitSinkFunction: implements a two-stage submission algorithm, making it possible to build end-to-end precise semantics applications.
Table API and SQL
- Flink SQL supports window join based on processing time and event time
- Flink SQL SupportINSERT INTO SELECT
- The Table API supports aggregation on stream tables, and previously only supports projection, selection, and union.
- Supports new source and sinks, including Kafka 0.11 and JDBC sink
- Flink SQL Using Apache Calcite 1.14
If Hadoop is not needed, you can no longer introduce Hadoop's dependencies.

1.5.0

Published in 2018, the following new features are introduced:

deploy
- Support for dynamic resource allocation and dynamic resource release for YARN and Mesos has been added to improve resource utilization, failure recovery, and dynamic scaling.
- Simplified deployment on container management infrastructure (such as Kubernates), and now requests to JobManager are REST, including job submission, cancellation, cancellation of job status, and obtaining save points.

Broadcast Status
- A broadcast state is a state that exists on all parallel instances. Typical usage of broadcast state involves two streams, one is a regular data stream and the other is a control/configuration stream, by broadcasting rules or patterns to all parallel instances, they can be applied to all events of the regular stream.
- The broadcast state can also be saved, modified and queried like other states, under the semantics that ensure accurate once.
network
- The performance of distributed streaming applications depends largely on the network connection components that transfer events from one operator to another. In the context of stream processing, latency and throughput are two very important metrics.
Task local status recovery
- Flink's checkpoint mechanism writes state to remote persistent storage and loads from remote storage in the event of a failure. This mechanism ensures that state is not lost when the application fails. However, in the event of a failure, it takes some time to load the status from the remote storage and restore the application.
- To improve the efficiency of failure recovery, based on the fact that failures usually occur in a single operator, TaskManager, or machine, it is proposedRecovery of the local status of the task. When writing the state of the operator to remote storage, Flink can now also keep a copy on each machine's local disk. In the event of a failure, the scheduler tries to reschedule the task to the previous machine and loads the state from the local disk instead of remote storage, thus speeding up recovery.

Table API and SQL extensions support for join
- Supports table join within a limited time range, time supports processing time and event time. The following example connects table d and table a. The arrival time field of table a can be in the table's depthTime two-hour window.

SELECT , , 
FROM Departures d LEFT OUTER JOIN Arrivals a
  ON  = 
  AND  BETWEEN 
       AND  + '2' HOURS

Supports non-window inner join, which is equivalent to standard SQL statements

SELECT , , , 
FROM Users u JOIN Orders o
  ON  =

SQL CLI client: Supports batch and streaming SQL queries
The application can scale without manually triggering savepoint. Flink will automatically create savepoints, stop the application, and then scale to a new parallelism.

1.6.0

Published in 2018, the following new features are introduced:

Improve status support
- Supported state TTL: allows setting the survival time (TTL) for the state. State that exceeds the survival time will be cleaned up. Operator state and keyed state will not grow wirelessly and will not be included in subsequent checkpoints.
- RocksDB-based timer state: After the timer state can be stored in RocksDB, a larger timer state can be supported.
- Improved Flink's internal timer data structure so that the deletion complexity is reduced from O(N) to O(log n)

Deployment Options for Extending Flink
- A cluster container entry point is provided to boot the start job cluster.
- The client now sends all job-related content to the server via a POST call.
SQL ClientEnhanced
- Support user-defined functions
- Support batch query
- Support INSERT INTO statements
Table API and SQL Enhancements
- Flink's Table & SQL API supports left, right and full external connections, allowing continuous result update queries.
- SQL aggregation functions support the DISTINCT keyword.
- For windowed and non-windowed aggregations, queries such as COUNT (DISTINCT columns) are supported.
- Both the SQL and Table APIs now include more built-in functions, such as MD5, SHA1, SHA2, LOG, and UNNEST for multiple sets.
DataStream API
- Support interval join: Events from different streams can now be joined together, with elements of one stream within a specified time interval relative to elements of the other stream.

1.7.0

Published in 2018, the following new features are introduced:

Fully support Scala 2.12
State Mode Evolution（state schema evolution）
- Add state mode evolution, allowing columns to be added or deleted in the user state schema to cope with changes in demand.
- State mode evolution is an out-of-the-box feature when using Avro-generated classes as user state.

Introduced in version 1.6.0StreamingFileSinkNow supportAccurately onceSemantic guaranteed writingS3 file system
Streaming SQLSupportMATCH_RECOGNIZE, used to perform pattern matching on data streams.
Temporal table (temporal table)andtemporal join：
- The temporal table provides a historical view of the changes of the table, which can return the contents of the table at a specific point in time.
- temporal joinAllow processing time or event time to useGeneral data flowandtemporal tableGo to join
- For example, there is a tense table containing historical currency exchange rates, which can obtain the exchange rate corresponding to any point in time. There is a data stream that joins the event time and temporal table in the data stream to convert the data in the data stream at the exchange rate at the time.
SQL Client supports defining views

1.8.0

Published in 2019, introducing the following new features:

State Mode Evolution

POJO data type supports state mode evolution
Upgrade all Flink serializers to use the new serialization compatibility abstraction
Provide predefined snapshot implementations for common serializers

Status Cleanup

Previously, state TTL was introduced in version 1.6.0, allowing keyed state to be cleaned after specified time.
Continuous cleanup of state is now introduced for heap-based state backends and RocksDB-based state backends

SQL Mode Detection

Allows user-defined functions and aggregation in SQL mode detection (MATCH_RECOGNIZE)

1.9.0

Published in 2019, the following new features are introduced: Flink’s goal is to develop a stream processing system to unify and support multiple forms of real-time, offline and event-driven applications. In this release, streaming and batching capabilities are integrated into a unified runtime.

Supports batch job recovery for specific regions

Before version 1.9.0, after the batch job failed, if you restore, you need to cancel all tasks and restart the job.

After version 1.9.0, after the batch job fails, you can configure to cancel and restart the failed task only:-strategy: region

State Processor API

After version 1.9.0, externalqueryable stateTo access the job status, version 1.9.0 introduces a batch processing that can be usedDataSet APIRead, write, and modify job statusState Processor API。

NewState Processor APISupports all snapshots:savepoint, full amountcheckpointand incrementcheckpoint。

Stop with Savepoint

Now you can use svaepoint to stop the job:bin/flink stop -p [:targetDirectory] :jobId

SQL API supports DDL

Previously, the SQL API only supported DML statements, and now it can support DDL statements, such asCREATE TABLE，DROP TABLE

Integration Hive

Hive inHadoopIt is widely used in the ecosystem to store and query a large amount of structured data, and users can query and process all data stored in Hive.

Supports UDFs for Hive in Table API or SQL queries.

Introducing the Python Table API

1.10.0

Published in 2020, introducing the following new features:

improveMemory managementand configuration

Introduce managed memory to solve the memory usage problem of RocksDB state backend. In order to unify the configuration of batch jobs and stream jobs, managed memory is located off-heap (off-heap)
Simplify RocksDB configuration: (1) The configuration is out of the box and no longer requires tuning (2) Allow local memory to be configured to avoid exceeding the total memory

Unified job submission logic

Prior to this release, job submission was the responsibility of the execution environment and was closely related to different deployment methods (such as Yarn, Kubernates, Mesos).
A common Executor interface is introduced for job submission, and a JobClient is introduced to obtain job execution results.

Integrate native Kubernetes

Not finished

Since each version has been published in 1.11.0, each version contains many features, and each version will be introduced separately, so stay tuned.