When should we prefer APPLY CHANGES INTO
over MERGE INTO
for CDC processing?
Use APPLY CHANGES INTO
when:
- You're working with structured streaming CDC data and need declarative semantics for I/U/D (Insert/Update/Delete).
- You want to reduce boilerplate code around deduplication, ordering, and CDC event interpretation.
- You want better performance and scalability on large tables with native change data capture support.
- You want native support for slowly changing dimensions (SCD Type 1) patterns.
In general, APPLY CHANGES INTO
is purpose-built for CDC pipelines, while MERGE INTO
is a more general-purpose tool.
What key limitations of MERGE INTO
does APPLY CHANGES INTO
address?
Limitation of MERGE INTO |
Addressed by APPLY CHANGES INTO |
---|---|
Not natively optimized for CDC data (I/U/D semantics must be coded manually) | Natively understands and handles CDC input schema |
Manual deduplication and ordering logic required | Automatically deduplicates based on sequence number and primary key |
Slower performance on large tables with high-frequency updates | Optimized for streaming ingestion with better write path performance |
Lacks built-in semantics for deletes and SCD patterns | Declaratively supports deletes (DELETE , UPDATE , INSERT logic) |
Complexity increases with schema evolution | Supports basic schema evolution out of the box |
Main differences in capability and behavior
Feature | MERGE INTO |
APPLY CHANGES INTO |
---|---|---|
Streaming support | Supported, but not optimized | Natively built for structured streaming |
CDC semantics | Manual | Declarative (Insert, Update, Delete) |
Deduplication | Must implement manually | Built-in using sequence + keys |
Concurrency | Possible issues in concurrent writes | Better concurrency handling in streaming |
Schema evolution | Supported (but can be complex) | Supported with less overhead |
Error handling | Manual | Automatic error modes available |
Performance | Slower for high-churn workloads | More efficient for large-scale streaming updates |
Code complexity | Higher | Lower (fewer lines, less logic) |
Can APPLY CHANGES INTO
fully replace MERGE INTO
in structured streaming pipelines with CDC logic?
In most cases, yes. If your data has a clear primary key, a reliable timestamp or sequence column, and your CDC source clearly defines the operation type (I/U/D), APPLY CHANGES INTO
is a better fit and can fully replace MERGE INTO
. However, for non-standard merge logic, complex joins, or custom SCD patterns beyond Type 1, MERGE INTO
might still be needed. MERGE INTO
is more flexible for batch-based and one-off ad-hoc operations.
Are there cost or performance advantages for large-scale Delta tables when using APPLY CHANGES INTO
?
Yes. Major advantages include:
- Lower compute cost: Efficiently processes CDC updates in micro-batches without repeatedly scanning large target tables.
- Reduced I/O: Native optimizations reduce the number of files rewritten, especially important in large Delta tables.
- Better scalability: Handles high-velocity streams with millions of events more effectively than
MERGE INTO
. - Optimized transaction overhead: Particularly important with Unity Catalog and concurrent writers.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin