Essential Data Cleaning steps to Succeed Recon

Janice Chi 100 Reputation points
2025-06-12T14:17:19.49+00:00

In our project, we are migrating 80TB of production-grade data from DB2 to Azure SQL Hyperscale using ADF and Databricks. While the primary transformation is data type conversion, what essential data cleaning steps should be performed to ensure consistency, prevent reconciliation failures between DB2 and Hyperscale, and maintain data quality across environments?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,492 questions
{count} votes

1 answer

Sort by: Most helpful
  1. J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
    2025-06-12T15:58:33.73+00:00

    Hi @Janice Chi

    In large-scale data migrations like ours (80TB from IBM DB2 to Azure SQL Hyperscale using ADF + Azure Databricks), ensuring data consistency and reconciliation accuracy is critical. While the primary transformation is data type conversion, here are the key data cleaning and standardization steps we followed in Databricks to avoid reconciliation mismatches and maintain data quality: 

    Data Profiling

    Before you start the migration, perform data profiling to understand the current state of your data. This will help identify inconsistencies, missing values, and duplicates that need addressing. 

    Data Type Conversion

    Since you're converting data types during the migration, ensure that the mapping from DB2 types to Azure SQL Hyperscale types is correct. Mismatched data types can lead to data loss or errors.

    Handling Missing Values

    Identify and manage missing or invalid values carefully to ensure data integrity and successful reconciliation post-migration: 

    • Detect Source-Specific Placeholders: In DB2, missing values are often stored as placeholders like '9999-12-31', '0001-01-01', 'NA', or -1. Replace these with NULL where semantically appropriate. 
    • Apply Business Rules for Imputation:  For critical fields (e.g., customer_type, country_code), apply default values or inferred logic if approved by stakeholders. 
    • For date fields, use techniques like "earliest valid date", "current date", or rule-driven lookups only when consistent and auditable. 
    • Flag or Isolate Bad Records: If rows have excessive nulls or incomplete critical fields, move them to a quarantine/error zone in ADLS for later review rather than silently dropping them. 
    • Maintain Recon Traceability: Always distinguish between source NULL, system-generated defaults, and actual business values to ensure transparency during reconciliation audits. 
    • Tools: Databricks: Use na.fill(), na.drop(), or when().otherwise() for complex imputations. 

    ADF Mapping Data Flows: Use Conditional Split + Derived Columns to handle missing data on-the-fly. 

    Proper missing value handling ensures both schema compliance during writes to Azure SQL Hyperscale and data trust during post-migration validation. 

    Removing Duplicates

    Identify and handle duplicate records in your source DB2 data before ingestion to avoid integrity and performance issues in Azure SQL: 
    Primary Key & Business Key Duplicates: Scan for duplicates based on natural keys or business identifiers (e.g., customer_id, order_id) to prevent upsert conflicts or merge failures. 

    • Hash-Based Deduplication: Use SHA2 hash functions in Azure Databricks (sha2(concat_ws('|', *columns), 256)) to detect full or partial row duplicates in large datasets. 
    • Window Functions for Latest Records: When dealing with multiple versions of the same entity (e.g., customer records with updates), use Spark row_number() or rank() over partitioned keys to retain only the latest version. 
    • Defer or Flag Complex Cases: If business rules are required to resolve duplicates (e.g., choosing records based on status, date, or source system), tag them during staging and resolve in a downstream cleansing step. 
    • Primary Key & Business Key Duplicates: Scan for duplicates based on natural keys or business identifiers (e.g., customer_id, order_id) to prevent upsert conflicts or merge failures. 

    By proactively deduplicating data at scale, you prevent unnecessary reconciliation noise, reduce data bloat in Azure SQL Hyperscale, and ensure referential integrity across fact and dimension tables. 

    Standardizing Data

    Ensure consistency across key fields to prevent reconciliation mismatches and ensure downstream compatibility: 

    • Date Formats: Normalize all date/time fields using a standard format (e.g., 'yyyy-MM-dd HH:mm:ss.SSS'). Ensure timezone awareness where applicable. 
    • String Cleanup: Apply .trim(), .lower()/.upper() to remove trailing/leading spaces and enforce consistent casing—especially important for keys used in joins or lookups. 
    • Address and Code Fields: Standardize abbreviations (e.g., "St." → "Street") using mapping tables or regex where needed. Useful for addresses, country/state codes, and enums. 
    • Boolean and Flag Values: Convert inconsistent representations like 'Y'/'N', '1'/'0', 'true'/'false' into a unified data type (BIT or BOOLEAN) across environments. 
    • Special Characters: Remove or encode special/non-ASCII characters from free-text fields using regex (e.g., regexp_replace(col, '[^\x20-\x7E]', '')). 

    Data Cleansing Tools

    Utilize cloud-native tools like Azure Data Factory and Azure Databricks for scalable and automated data cleansing: 

    • Azure Data Factory (ADF): Use Mapping Data Flows for transformations such as null handling, type casting, trimming, deduplication, and conditional logic all without writing code. 
    • Azure Databricks (PySpark): Ideal for large-scale cleansing tasks like regex-based cleaning, string normalization, hash generation for reconciliation, and complex data transformations. 

    Validation After Migration

    Implement checks post-migration to validate that the data in Azure SQL Hyperscale matches with what was in DB2. This can include reconciliation reports and sampling methodologies.

    Documentation

    Maintain detailed documentation of your data transformation processes and any cleaning steps taken. This can be crucial for audits and reconciliation.

    By following these steps, you can enhance data quality and maintain consistency between your old DB2 environment and the new Azure SQL Hyperscale setup. 

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.