Essential Data Cleaning steps to Succeed Recon

Question

Essential Data Cleaning steps to Succeed Recon

Janice Chi 100

In our project, we are migrating 80TB of production-grade data from DB2 to Azure SQL Hyperscale using ADF and Databricks. While the primary transformation is data type conversion, what essential data cleaning steps should be performed to ensure consistency, prevent reconciliation failures between DB2 and Hyperscale, and maintain data quality across environments?

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-13T08:09:50.1666667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-13T08:09:50.1666667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi @Janice Chi

In large-scale data migrations like ours (80TB from IBM DB2 to Azure SQL Hyperscale using ADF + Azure Databricks), ensuring data consistency and reconciliation accuracy is critical. While the primary transformation is data type conversion, here are the key data cleaning and standardization steps we followed in Databricks to avoid reconciliation mismatches and maintain data quality:

Data Profiling

Before you start the migration, perform data profiling to understand the current state of your data. This will help identify inconsistencies, missing values, and duplicates that need addressing.

Data Type Conversion

Since you're converting data types during the migration, ensure that the mapping from DB2 types to Azure SQL Hyperscale types is correct. Mismatched data types can lead to data loss or errors.

Handling Missing Values

Identify and manage missing or invalid values carefully to ensure data integrity and successful reconciliation post-migration:

Detect Source-Specific Placeholders: In DB2, missing values are often stored as placeholders like '9999-12-31', '0001-01-01', 'NA', or -1. Replace these with NULL where semantically appropriate.
Apply Business Rules for Imputation: For critical fields (e.g., customer_type, country_code), apply default values or inferred logic if approved by stakeholders.
For date fields, use techniques like "earliest valid date", "current date", or rule-driven lookups only when consistent and auditable.
Flag or Isolate Bad Records: If rows have excessive nulls or incomplete critical fields, move them to a quarantine/error zone in ADLS for later review rather than silently dropping them.
Maintain Recon Traceability: Always distinguish between source NULL, system-generated defaults, and actual business values to ensure transparency during reconciliation audits.
Tools: Databricks: Use na.fill(), na.drop(), or when().otherwise() for complex imputations.

ADF Mapping Data Flows: Use Conditional Split + Derived Columns to handle missing data on-the-fly.

Proper missing value handling ensures both schema compliance during writes to Azure SQL Hyperscale and data trust during post-migration validation.

Removing Duplicates

Identify and handle duplicate records in your source DB2 data before ingestion to avoid integrity and performance issues in Azure SQL:
Primary Key & Business Key Duplicates: Scan for duplicates based on natural keys or business identifiers (e.g., customer_id, order_id) to prevent upsert conflicts or merge failures.

Hash-Based Deduplication: Use SHA2 hash functions in Azure Databricks (sha2(concat_ws('|', *columns), 256)) to detect full or partial row duplicates in large datasets.
Window Functions for Latest Records: When dealing with multiple versions of the same entity (e.g., customer records with updates), use Spark row_number() or rank() over partitioned keys to retain only the latest version.
Defer or Flag Complex Cases: If business rules are required to resolve duplicates (e.g., choosing records based on status, date, or source system), tag them during staging and resolve in a downstream cleansing step.
Primary Key & Business Key Duplicates: Scan for duplicates based on natural keys or business identifiers (e.g., customer_id, order_id) to prevent upsert conflicts or merge failures.

By proactively deduplicating data at scale, you prevent unnecessary reconciliation noise, reduce data bloat in Azure SQL Hyperscale, and ensure referential integrity across fact and dimension tables.

Standardizing Data

Ensure consistency across key fields to prevent reconciliation mismatches and ensure downstream compatibility:

Date Formats: Normalize all date/time fields using a standard format (e.g., 'yyyy-MM-dd HH:mm:ss.SSS'). Ensure timezone awareness where applicable.
String Cleanup: Apply .trim(), .lower()/.upper() to remove trailing/leading spaces and enforce consistent casing—especially important for keys used in joins or lookups.
Address and Code Fields: Standardize abbreviations (e.g., "St." → "Street") using mapping tables or regex where needed. Useful for addresses, country/state codes, and enums.
Boolean and Flag Values: Convert inconsistent representations like 'Y'/'N', '1'/'0', 'true'/'false' into a unified data type (BIT or BOOLEAN) across environments.
Special Characters: Remove or encode special/non-ASCII characters from free-text fields using regex (e.g., regexp_replace(col, '[^\x20-\x7E]', '')).

Data Cleansing Tools

Utilize cloud-native tools like Azure Data Factory and Azure Databricks for scalable and automated data cleansing:

Azure Data Factory (ADF): Use Mapping Data Flows for transformations such as null handling, type casting, trimming, deduplication, and conditional logic all without writing code.
Azure Databricks (PySpark): Ideal for large-scale cleansing tasks like regex-based cleaning, string normalization, hash generation for reconciliation, and complex data transformations.

Validation After Migration

Implement checks post-migration to validate that the data in Azure SQL Hyperscale matches with what was in DB2. This can include reconciliation reports and sampling methodologies.

Documentation

Maintain detailed documentation of your data transformation processes and any cleaning steps taken. This can be crucial for audits and reconciliation.

By following these steps, you can enhance data quality and maintain consistency between your old DB2 environment and the new Azure SQL Hyperscale setup.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Janice Chi 100 Reputation points

2025-06-14T13:02:25.5033333+00:00

please explian how we can achieve this in deatil with examples -Maintain Recon Traceability: Always distinguish between source NULL, system-generated defaults, and actual business values to ensure transparency during reconciliation audits.
J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-16T13:35:40.52+00:00

@Janice Chi

1.Identify Columns of Interest

Focus on columns likely to have default or null values (e.g., date_of_birth, amount, status_code).

2.List Known Default Placeholders

Document system-generated defaults used in the source DB2 system, such as:

'9999-12-31', '0001-01-01' for dates

-1, 0, 'NA', 'UNKNOWN' for numeric or string fields

3.Create a Status Field for Each Column

Add a new column (e.g., dob_status) to capture the origin of each value:

Label as SOURCE_NULL if the value was null in DB2

Label as DEFAULT_<value> if the value was a known default

Label as BUSINESS_VALUE if it's a valid value

4.Clean the Actual Data for Target System

Replace system default values with NULL or proper clean values (as per business rules), while keeping the status field for traceability.

5.Write Both Cleaned Data and Status to Staging Layer

Save the cleaned value and its status into the same table or Delta Lake file, so that audit/recon teams can refer to both.

6.Use Status Field in Reconciliation Reports

During post-migration validation, use the status fields to explain why a value is null or changed.

7.Retain Traceability Logs

Store the original values and their tags in audit or control tables for long-term visibility.

These steps ensure that your migration is auditable, traceable, and transparent, especially when dealing with large volumes and sensitive data.

I hope this information helps. Please do let us know if you have any further queries.
J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-16T13:42:25.3533333+00:00

@Janice Chi
Do you need any further information and explanation please feel free to ask I am happy to help
Thank you

Share via

Essential Data Cleaning steps to Succeed Recon

1 answer

Your answer