Hi @Janice Chi
In large-scale data migrations like ours (80TB from IBM DB2 to Azure SQL Hyperscale using ADF + Azure Databricks), ensuring data consistency and reconciliation accuracy is critical. While the primary transformation is data type conversion, here are the key data cleaning and standardization steps we followed in Databricks to avoid reconciliation mismatches and maintain data quality:
Data Profiling
Before you start the migration, perform data profiling to understand the current state of your data. This will help identify inconsistencies, missing values, and duplicates that need addressing.
Data Type Conversion
Since you're converting data types during the migration, ensure that the mapping from DB2 types to Azure SQL Hyperscale types is correct. Mismatched data types can lead to data loss or errors.
Handling Missing Values
Identify and manage missing or invalid values carefully to ensure data integrity and successful reconciliation post-migration:
- Detect Source-Specific Placeholders: In DB2, missing values are often stored as placeholders like '9999-12-31', '0001-01-01', 'NA', or -1. Replace these with NULL where semantically appropriate.
- Apply Business Rules for Imputation: For critical fields (e.g., customer_type, country_code), apply default values or inferred logic if approved by stakeholders.
- For date fields, use techniques like "earliest valid date", "current date", or rule-driven lookups only when consistent and auditable.
- Flag or Isolate Bad Records: If rows have excessive nulls or incomplete critical fields, move them to a quarantine/error zone in ADLS for later review rather than silently dropping them.
- Maintain Recon Traceability: Always distinguish between source NULL, system-generated defaults, and actual business values to ensure transparency during reconciliation audits.
- Tools: Databricks: Use na.fill(), na.drop(), or when().otherwise() for complex imputations.
ADF Mapping Data Flows: Use Conditional Split + Derived Columns to handle missing data on-the-fly.
Proper missing value handling ensures both schema compliance during writes to Azure SQL Hyperscale and data trust during post-migration validation.
Removing Duplicates
Identify and handle duplicate records in your source DB2 data before ingestion to avoid integrity and performance issues in Azure SQL:
Primary Key & Business Key Duplicates: Scan for duplicates based on natural keys or business identifiers (e.g., customer_id, order_id) to prevent upsert conflicts or merge failures.
- Hash-Based Deduplication: Use SHA2 hash functions in Azure Databricks (sha2(concat_ws('|', *columns), 256)) to detect full or partial row duplicates in large datasets.
- Window Functions for Latest Records: When dealing with multiple versions of the same entity (e.g., customer records with updates), use Spark row_number() or rank() over partitioned keys to retain only the latest version.
- Defer or Flag Complex Cases: If business rules are required to resolve duplicates (e.g., choosing records based on status, date, or source system), tag them during staging and resolve in a downstream cleansing step.
- Primary Key & Business Key Duplicates: Scan for duplicates based on natural keys or business identifiers (e.g., customer_id, order_id) to prevent upsert conflicts or merge failures.
By proactively deduplicating data at scale, you prevent unnecessary reconciliation noise, reduce data bloat in Azure SQL Hyperscale, and ensure referential integrity across fact and dimension tables.
Standardizing Data
Ensure consistency across key fields to prevent reconciliation mismatches and ensure downstream compatibility:
- Date Formats: Normalize all date/time fields using a standard format (e.g., 'yyyy-MM-dd HH:mm:ss.SSS'). Ensure timezone awareness where applicable.
- String Cleanup: Apply .trim(), .lower()/.upper() to remove trailing/leading spaces and enforce consistent casing—especially important for keys used in joins or lookups.
- Address and Code Fields: Standardize abbreviations (e.g., "St." → "Street") using mapping tables or regex where needed. Useful for addresses, country/state codes, and enums.
- Boolean and Flag Values: Convert inconsistent representations like 'Y'/'N', '1'/'0', 'true'/'false' into a unified data type (BIT or BOOLEAN) across environments.
- Special Characters: Remove or encode special/non-ASCII characters from free-text fields using regex (e.g., regexp_replace(col, '[^\x20-\x7E]', '')).
Data Cleansing Tools
Utilize cloud-native tools like Azure Data Factory and Azure Databricks for scalable and automated data cleansing:
- Azure Data Factory (ADF): Use Mapping Data Flows for transformations such as null handling, type casting, trimming, deduplication, and conditional logic all without writing code.
- Azure Databricks (PySpark): Ideal for large-scale cleansing tasks like regex-based cleaning, string normalization, hash generation for reconciliation, and complex data transformations.
Validation After Migration
Implement checks post-migration to validate that the data in Azure SQL Hyperscale matches with what was in DB2. This can include reconciliation reports and sampling methodologies.
Documentation
Maintain detailed documentation of your data transformation processes and any cleaning steps taken. This can be crucial for audits and reconciliation.
By following these steps, you can enhance data quality and maintain consistency between your old DB2 environment and the new Azure SQL Hyperscale setup.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.