Data EngineeringMay 11, 2026

Building a Self-Healing MongoDB-to-BigQuery ETL Engine in Spring Batch

How I engineered a highly resilient, schema-evolution-adaptive ETL pipeline to replicate schema-free MongoDB documents to BigQuery in Full and Incremental modes using Spring Batch.

#Spring Boot#Spring Batch#MongoDB#BigQuery#ETL#Schema Evolution#Self-Healing

Executive Summary

Replicating distributed transactional data from MongoDB (a schema-free, document-oriented NoSQL database) to Google Cloud BigQuery (a schema-strict relational data warehouse) presents unique challenges. The primary production bottlenecks are unpredictable schema evolution (dynamic schemas) and field type conflicts (polymorphic fields—e.g., a userId field stored as an integer/numeric in legacy records but arriving as a string or a nested object in newer documents).

In our legacy system, whenever the product team released a feature that changed a document structure or modified field types in MongoDB, the ETL pipeline would fail instantly because BigQuery rejected the mismatched schema. This required manual intervention from a data engineer to adjust column mapping and rebuild tables.

To solve this problem permanently, I designed and built a custom asynchronous ETL engine using Spring Boot and Spring Batch. The engine replicates data efficiently via GCS staging, automatically handles BigQuery schema evolution, and heals itself from polymorphic type conflicts without interrupting running pipelines.

Parallel Multi-Collection Pipeline Architecture

The ETL engine leverages the asynchronous processing and multi-threading capabilities of Spring Batch to sync multiple MongoDB collections in parallel. The data pipeline for each collection is split into three sequential steps:

Read & Buffer Step (mongo_read_{collection}): Reads data from MongoDB in transactional chunks and writes them into a local JSON Lines (JSONL) file sequentially.
Upload Step (gcs_upload_{collection}): Uploads the local JSONL file to a Google Cloud Storage (GCS) staging bucket.
Load Step (bq_load_{collection}): Triggers a Google Cloud BigQuery Load Job API call to ingest the GCS file into production tables, handling schema validation and polymorphic type resolution dynamically.

This parallel processing flow is illustrated in the diagram below:

graph TD
    subgraph Job [Spring Batch: mongoGcsTransferJob]
        Split[SplitState: Parallel Execution] -->|Thread 1| Flow_A[Flow: collection_A]
        Split -->|Thread 2| Flow_B[Flow: collection_B]
        
        subgraph Flow_A [Flow: Collection A]
            Read_A[Step 1: mongo_read_A <br> Read Mongo -> Write local JSONL] -->|Success| Upload_A[Step 2: gcs_upload_A <br> Upload JSONL to GCS]
            Upload_A -->|Success| Load_A[Step 3: bq_load_A <br> Reconcile Schema -> BQ Load Job]
        end

        subgraph Flow_B [Flow: Collection B]
            Read_B[Step 1: mongo_read_B <br> Read Mongo -> Write local JSONL] -->|Success| Upload_B[Step 2: gcs_upload_B <br> Upload JSONL to GCS]
            Upload_B -->|Success| Load_B[Step 3: bq_load_B <br> Reconcile Schema -> BQ Load Job]
        end
    end

    style Split fill:#f9f,stroke:#333,stroke-width:1px
    style Flow_A fill:#f5f5f5,stroke:#aaa,stroke-width:1px
    style Flow_B fill:#f5f5f5,stroke:#aaa,stroke-width:1px

Two-Dimensional Watermark Ingestion (Chronological Slicing)

To execute Incremental Mode efficiently without losing data due to replication lags, the ETL engine dynamically pulls the latest watermark state from BigQuery before querying the source MongoDB collection.

If an incremental query filters data using a simple updated timestamp (updated_at > last_watermark), there is a high risk of data loss if multiple documents share the exact same millisecond timestamp and only a subset was processed in the previous run.

To prevent this, I implemented a Two-Dimensional Watermark strategy combining the update timestamp (updated_at) and the unique document identifier (_id):

-- Query Dynamic Watermark in BigQuery
SELECT 
  FORMAT_TIMESTAMP('%FT%H:%M:%E6SZ', updated_at, 'UTC') AS max_watermark, 
  CAST(_id AS STRING) AS last_id 
FROM 
  `project_id.dataset_name.table_name` 
WHERE 
  updated_at IS NOT NULL 
ORDER BY 
  updated_at DESC, 
  CAST(_id AS STRING) DESC 
LIMIT 1;

Based on this watermark state, MongoGcsItemReader generates a dynamic MongoDB query using an OR logical operator:

Documents where updated_at is strictly greater than last_watermark_timestamp.
OR documents where updated_at matches last_watermark_timestamp but _id is lexicographically greater than last_watermark_id.

This approach ensures a 100% consistent exactly-once ingestion contract without missing any records.

Polymorphic Type Self-Healing & Schema Reconciliation

When a schema or type conflict arises between the existing BigQuery table and the newly compiled MongoDB documents, the engine resolves it according to the configured schemaConflictPolicy:

BqPolymorphicDetector (Sampling Stage): Before processing the JSONL file, the detector samples the first 1000 records from the GCS bucket. If it detects that a field contains multiple active data types (e.g., the field payload is a STRING in some rows and a nested RECORD in others), it flags the path as polymorphic.
Schema Coercion: The detected polymorphic field is coerced to a simple STRING type in BigQuery. The nested JSON value is serialized to a standard string representation to allow the BigQuery Load Job to succeed.
Companion Field Resolution (__str Suffix): If a type conflict occurs when comparing the newly inferred schema against the existing BigQuery table, the engine does not abort the run. Instead, it dynamically appends a companion field with a __str suffix (e.g., if the original field userId is INTEGER, a new field userId__str of type STRING is created via an alter DDL query). Arriving data with conflicting types is rerouted to this companion field.

Here is the key implementation of BqSchemaReconciler that generates these companion fields upon type conflict:

private void handleConflictByPolicy(
        Map<String, Field> merged,
        Map<String, String> canonicalNameByLowerCase,
        Field existingField,
        Field inferredField,
        String fieldPath,
        String schemaConflictPolicy) {
    
    if ("fail-fast".equalsIgnoreCase(schemaConflictPolicy)) {
        throw new IllegalStateException(String.format(
                "Schema conflict on %s: existingType=%s, inferredType=%s",
                fieldPath, existingField.getType(), inferredField.getType()));
    }

    // Coerce-to-string policy: create a companion STRING field
    String companionName = existingField.getName() + "__str";
    String existingCompanionName = canonicalNameByLowerCase.get(companionName.toLowerCase());
    
    if (existingCompanionName == null) {
        Field companionField = Field.newBuilder(companionName, LegacySQLTypeName.STRING)
                .setMode(Field.Mode.NULLABLE)
                .build();
        merged.put(companionName, companionField);
        canonicalNameByLowerCase.put(companionName.toLowerCase(), companionName);
        log.warn("Schema type conflict on {}: existingType={}, inferredType={}. Added companion field '{}' (STRING)",
                fieldPath, existingField.getType(), inferredField.getType(), companionName);
    }
}

Spring Batch Parallel Job Orchestration

Below is a snippet of the central BatchConfigurationMongoGcs configuration, orchestrating parallel execution across collections using SplitState and connecting reader/writer components:

@Configuration
public class BatchConfigurationMongoGcs {

    private final JobRepository jobRepository;
    private final PlatformTransactionManager transactionManager;
    private final MongoGcsTableResolverService resolverService;

    // Executor for parallel collection processing
    @Bean("mongoGcsTaskExecutor")
    @JobScope
    public TaskExecutor mongoGcsTaskExecutor(@Value("#{jobParameters['parallelismLevel']}") String parallelism) {
        int coreSize = parallelism != null ? Integer.parseInt(parallelism) : 2;
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(coreSize);
        executor.setMaxPoolSize(10);
        executor.setThreadNamePrefix("mongo-gcs-");
        executor.initialize();
        return executor;
    }

    @Bean("mongoGcsTransferJob")
    public Job mongoGcsTransferJob(List<MongoGcsCollectionConfig> configs, TaskExecutor executor) {
        JobBuilder jobBuilder = new JobBuilder("mongoGcsTransferJob", jobRepository)
                .incrementer(new RunIdIncrementer());

        // Create flows for each enabled collection
        List<Flow> flows = configs.stream()
                .filter(MongoGcsCollectionConfig::getEnabled)
                .map(this::buildCollectionFlow)
                .toList();

        // Split flows across threads for parallel execution
        return jobBuilder
                .start(flows.get(0))
                .split(executor)
                .add(flows.subList(1, flows.size()).toArray(new Flow[0]))
                .end()
                .build();
    }

    private Flow buildCollectionFlow(MongoGcsCollectionConfig config) {
        String name = config.getName();
        Step readStep = new StepBuilder("mongo_read_" + name, jobRepository)
                .<Map<String, Object>, Map<String, Object>>chunk(config.getChunkSize(), transactionManager)
                .reader(resolverService.createReader(config))
                .writer(resolverService.createWriter(config))
                .build();

        Step gcsUploadStep = buildGcsUploadStep(config);
        Step bqLoadStep = buildBqLoadStep(config);

        return new FlowBuilder<SimpleFlow>("flow_" + name)
                .start(readStep)
                .next(gcsUploadStep)
                .next(bqLoadStep)
                .build();
    }
}

Business Impact & Results

Developing this self-healing ETL engine generated measurable operational gains for the company’s data platform:

Zero Ingestion Failures: The polymorphic self-healing mechanism cut BigQuery loading errors from several incidents per month to 0%.
Automated Schema Evolution: BigQuery production tables evolve dynamically without downtime, saving data engineers 15+ hours per week of manual DDL updates.
MoEngage Integration Efficiency: By delivering reliable and timely dataset syncs, user segmentation syncs to the MoEngage CRM run seamlessly, reducing marketing campaign latency by 24 hours.
Fast, Scalable Performance: Spring Batch parallelism processes dozens of MongoDB collections simultaneously, loading millions of documents in under 15 minutes.