Share via


Data engineering with Databricks

Databricks provides Lakeflow, an end-to-end data engineering solution that empowers data engineers, software developers, SQL developers, analysts, and data scientists to deliver high-quality data for downstream analytics, AI, and operational applications. Lakeflow is a unified solution for ingestion, transformation, and orchestration of your data, and includes Lakeflow Connect, Lakeflow Declarative Pipelines, and Lakeflow Jobs.

Lakeflow Connect

Lakeflow Connect simplifies data ingestion with connectors to popular enterprise applications, databases, cloud storage, message buses, and local files. See Lakeflow Connect.

Feature Description
Managed connectors Managed connectors provide a simple UI and a configuration-based ingestion service with minimum operational overhead, without requiring you to use the underlying Lakeflow Declarative Pipelines APIs and infrastructure.
Standard connectors Standard connectors provide the ability to access data from a wider range of data sources from within your Lakeflow Declarative Pipelines or other queries.

Lakeflow Declarative Pipelines

Lakeflow Declarative Pipelines is a declarative framework that lowers the complexity of building and managing efficient batch and streaming data pipelines. Lakeflow Declarative Pipelines runs on the performance-optimized Databricks Runtime. In addition, Lakeflow Declarative Pipelines automatically orchestrates the execution of flows, sinks, streaming tables, and materialized views by encapsulating and running them as a pipeline. See Lakeflow Declarative Pipelines.

Feature Description
Flows Flows process data in Lakeflow Declarative Pipelines. The flows API uses the same DataFrame API as Apache Spark and Structured Streaming. A flow can write into streaming tables and sinks, such as a Kafka topic, using streaming semantics, or it can write to a materialized view using batch semantics.
Streaming tables A streaming table is a Delta table with additional support for streaming or incremental data processing. It acts as a target for one or more flows in Lakeflow Declarative Pipelines.
Materialized views A materialized view is a view with cached results for faster access. A materialized view acts as a target for Lakeflow Declarative Pipelines.
Sinks Lakeflow Declarative Pipelines support external data sinks as targets. These sinks can include event streaming services, like Apache Kafka or Azure Event Hubs, as well as external tables managed by Unity Catalog.

Lakeflow Jobs

Lakeflow Jobs provide reliable orchestration and production monitoring for any data and AI workload. A job can consist of one or more tasks that run notebooks, pipelines, managed connectors, SQL queries, machine learning training, and model deployment and inference. Jobs also support custom control flow logic, such as branching with if / else statements, and looping with for each statements. See Lakeflow Jobs.

Feature Description
Jobs Jobs are the primary resource for orchestration. They represent a process that you want to perform on a scheduled basis.
Tasks A specific unit of work within a job. There are a variety of task types that give you a range of options that can be performed within a job.
Control flow in jobs Control flow tasks allow you to control whether to run other tasks, or the order of tasks to run.

Databricks Runtime for Apache Spark

The Databricks Runtime is a reliable and performance-optimized compute environment for running Spark workloads, including batch and streaming. Databricks Runtime provides Photon, a high-performance Databricks-native vectorized query engine, and various infrastructure optimizations like autoscaling. You can run your Spark and Structured Streaming workloads on the Databricks Runtime by building your Spark programs as notebooks, JARs, or Python wheels. See Databricks Runtime for Apache Spark.

Feature Description
Apache Spark on Databricks Spark is at the heart of the Databricks Data Intelligence Platform.
Structured Streaming Structured Streaming is the Spark near real-time processing engine for streaming data.

What happened to Delta Live Tables (DLT)?

The product formerly known as Delta Live Tables (DLT) is now Lakeflow Declarative Pipelines. There is no migration required to use Lakeflow Declarative Pipelines.

Note

There are still some references to the DLT name in Databricks. The classic SKUs for Lakeflow Declarative Pipelines still begin with DLT, and APIs with DLT in the name have not changed.

Additional resources