Hash calculation strategy for datatypes mismatch

Janice Chi 100 Reputation points
2025-06-14T13:08:58.0566667+00:00

User's image

In our current project, we are migrating data from an on-premises IBM DB2 system to Azure SQL Hyperscale, using Azure Databricks for transformation and reconciliation. This includes both batch and CDC-based pipelines.

Our project requirement is not just to ingest, but also to perform strict reconciliation between source (DB2) and target (Hyperscale). We have decided to:

Compute row count and row-level hash values inside both DB2 and Hyperscale.

Then bring these hashes and counts into Databricks for comparison and reconciliation.

Questions:

Is this a recommended and reliable approach for reconciliation — calculating row-level hash values inside both source (DB2) and target (Hyperscale), and comparing them in Databricks?

For the following data type transformation pairs (see table below), what is the recommended step-by-step process to ensure that hash values match despite the data type differences? Specifically:

How hash is likely calculated in DB2 vs Hyperscale for each type.

  What adjustments or normalization we should do in Databricks before comparison.
  
     Should we exclude any of these columns from hash if lossless normalization is not possible?
     
Source Data Type Target Data Type
timestamp datetime2
timestamp datetime2
integer int
char(15) int
varchar(16) char(16)
varchar(50) nvarchar(50)
char(25) nchar(25)
varchar(30) nvarchar(30)
varchar(60) nvarchar(60)
decimal(12,2) numeric(12,2)
decimal(10,0) bigint
decimal(24,12) numeric(24,12)
xml ccsid xml
varchar(16) nvarchar(16)

We would appreciate Microsoft’s guidance on whether this reconciliation method is production-grade, and how to best handle data type mismatches to ensure hash consistency.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,482 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.