Hash calculation strategy for datatypes mismatch
In our current project, we are migrating data from an on-premises IBM DB2 system to Azure SQL Hyperscale, using Azure Databricks for transformation and reconciliation. This includes both batch and CDC-based pipelines.
Our project requirement is not just to ingest, but also to perform strict reconciliation between source (DB2) and target (Hyperscale). We have decided to:
Compute row count and row-level hash values inside both DB2 and Hyperscale.
Then bring these hashes and counts into Databricks for comparison and reconciliation.
Questions:
Is this a recommended and reliable approach for reconciliation — calculating row-level hash values inside both source (DB2) and target (Hyperscale), and comparing them in Databricks?
For the following data type transformation pairs (see table below), what is the recommended step-by-step process to ensure that hash values match despite the data type differences? Specifically:
How hash is likely calculated in DB2 vs Hyperscale for each type.
What adjustments or normalization we should do in Databricks before comparison.
Should we exclude any of these columns from hash if lossless normalization is not possible?
Source Data Type | Target Data Type |
---|---|
timestamp | datetime2 |
timestamp | datetime2 |
integer | int |
char(15) | int |
varchar(16) | char(16) |
varchar(50) | nvarchar(50) |
char(25) | nchar(25) |
varchar(30) | nvarchar(30) |
varchar(60) | nvarchar(60) |
decimal(12,2) | numeric(12,2) |
decimal(10,0) | bigint |
decimal(24,12) | numeric(24,12) |
xml ccsid | xml |
varchar(16) | nvarchar(16) |
We would appreciate Microsoft’s guidance on whether this reconciliation method is production-grade, and how to best handle data type mismatches to ensure hash consistency.