When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

Janice Chi 100 Reputation points
2025-06-14T12:13:23.58+00:00

Background: In our CDC pipeline, we use Databricks to process Kafka CDC data (I/U/D events) into Delta tables. We’re evaluating whether to continue using MERGE INTO or shift to APPLY CHANGES INTO.


❓ Questions for Microsoft:

When should we prefer APPLY CHANGES INTO over MERGE INTO for CDC processing?

What are the key limitations of MERGE INTO that APPLY CHANGES INTO addresses (e.g., performance, concurrency, native CDC support)?

What are the main differences in capability and behavior between the two (streaming support, schema evolution, error handling, deduplication)?

Can APPLY CHANGES INTO fully replace MERGE INTO in structured streaming pipelines using CDC I/U/D logic?

  1. Are there cost or performance advantages in large-scale Delta tables when using APPLY CHANGES INTO
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,482 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Marcin Policht 49,005 Reputation points MVP Volunteer Moderator
    2025-06-14T12:45:21.86+00:00

    When should we prefer APPLY CHANGES INTO over MERGE INTO for CDC processing?

    Use APPLY CHANGES INTO when:

    • You're working with structured streaming CDC data and need declarative semantics for I/U/D (Insert/Update/Delete).
    • You want to reduce boilerplate code around deduplication, ordering, and CDC event interpretation.
    • You want better performance and scalability on large tables with native change data capture support.
    • You want native support for slowly changing dimensions (SCD Type 1) patterns.

    In general, APPLY CHANGES INTO is purpose-built for CDC pipelines, while MERGE INTO is a more general-purpose tool.

    What key limitations of MERGE INTO does APPLY CHANGES INTO address?

    Limitation of MERGE INTO Addressed by APPLY CHANGES INTO
    Not natively optimized for CDC data (I/U/D semantics must be coded manually) Natively understands and handles CDC input schema
    Manual deduplication and ordering logic required Automatically deduplicates based on sequence number and primary key
    Slower performance on large tables with high-frequency updates Optimized for streaming ingestion with better write path performance
    Lacks built-in semantics for deletes and SCD patterns Declaratively supports deletes (DELETE, UPDATE, INSERT logic)
    Complexity increases with schema evolution Supports basic schema evolution out of the box

    Main differences in capability and behavior

    Feature MERGE INTO APPLY CHANGES INTO
    Streaming support Supported, but not optimized Natively built for structured streaming
    CDC semantics Manual Declarative (Insert, Update, Delete)
    Deduplication Must implement manually Built-in using sequence + keys
    Concurrency Possible issues in concurrent writes Better concurrency handling in streaming
    Schema evolution Supported (but can be complex) Supported with less overhead
    Error handling Manual Automatic error modes available
    Performance Slower for high-churn workloads More efficient for large-scale streaming updates
    Code complexity Higher Lower (fewer lines, less logic)

    Can APPLY CHANGES INTO fully replace MERGE INTO in structured streaming pipelines with CDC logic?

    In most cases, yes. If your data has a clear primary key, a reliable timestamp or sequence column, and your CDC source clearly defines the operation type (I/U/D), APPLY CHANGES INTO is a better fit and can fully replace MERGE INTO. However, for non-standard merge logic, complex joins, or custom SCD patterns beyond Type 1, MERGE INTO might still be needed. MERGE INTO is more flexible for batch-based and one-off ad-hoc operations.

    Are there cost or performance advantages for large-scale Delta tables when using APPLY CHANGES INTO?

    Yes. Major advantages include:

    • Lower compute cost: Efficiently processes CDC updates in micro-batches without repeatedly scanning large target tables.
    • Reduced I/O: Native optimizations reduce the number of files rewritten, especially important in large Delta tables.
    • Better scalability: Handles high-velocity streams with millions of events more effectively than MERGE INTO.
    • Optimized transaction overhead: Particularly important with Unity Catalog and concurrent writers.

    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.