Data Engineering

Building a Self-Healing MongoDB-to-BigQuery ETL Engine in Spring Batch

How I engineered a highly resilient, schema-evolution-aware ETL pipeline using Spring Batch that replicates schema-less MongoDB documents into structured BigQuery tables in Full and Incremental watermark modes.

#Spring Boot #Spring Batch #MongoDB #BigQuery #ETL #Schema Evolution #Self-Healing

Executive Summary

When replicating distributed transactional data from MongoDB (a schema-less NoSQL database) into BigQuery (a strictly structured relational data warehouse), the biggest challenges are structural changes at the source (dynamic schemas) and data type conflicts (polymorphic fields, e.g., a userId column that is sometimes a string and sometimes an object). In legacy systems, any schema change in MongoDB would stall the ETL pipeline and require manual column mapping configurations.

To solve this, I designed and built an intelligent MongoDB-to-BigQuery ETL engine based on Spring Boot & Spring Batch. It is asynchronously controlled by upstream watermarks, supports both Full Load and Incremental modes, and self-heals when data type conflicts are detected (polymorphic self-healing).


Core Architecture & Key Features

  1. Dynamic Schema Observation & Evolution: The ETL engine actively analyzes incoming MongoDB document structures using a dynamic Spring Batch ItemReader. If a new column is detected that does not yet exist in the target BigQuery table, the ETL automatically triggers an asynchronous DDL ALTER TABLE ADD COLUMN command to BigQuery before loading new records. This eliminates manual schema management.
  2. Polymorphic Type Self-Healing: A classic MongoDB challenge is inconsistent data types in the same field. The ETL engine implements data type conflict detection inside a custom Spring Batch ItemProcessor. If a type conflict is found (e.g., String vs Object), the processor automatically normalizes the type (e.g., serializing the object into a JSON string or flattening the fields) to prevent ingestion failures in BigQuery.
  3. Double Ingestion Modes (Full & Incremental):
    • Full Load Mode: A full replication of all MongoDB documents to bootstrap or refresh the target BigQuery tables from scratch.
    • Incremental Mode (Watermark-based): A compute-efficient replication that scans upstream updated timestamps (updated_at watermark) to process only the latest delta changes.
  4. CRM segment automation via MoEngage: This pipeline also serves automated segment syncs for the CRM team. Integrating BigQuery, GCS, and MoEngage, it automates daily marketing list syncs, saving 24 hours of manual labor per segment sync cycle.

Business Impact & Performance Metrics

  • Zero Maintenance Overhead: Eliminated all manual engineering intervention when product teams update MongoDB documents at the source.
  • High Ingestion Reliability: Reduced ingestion failure rates to near 0% through adaptive polymorphic type handling.