Data Architecture

Building an Incremental Watermark-Driven Reverse ETL with Spring Batch

How I designed and engineered a custom Java Reverse ETL application to sync processed analytics data from BigQuery back to Cloud SQL PostgreSQL using Spring Batch supporting Full, Incremental, and Merge modes.

#Spring Boot #Spring Batch #BigQuery #PostgreSQL #Cloud SQL #Reverse ETL #Watermark

Executive Summary

In modern data cycles, synchronizing processed insights from a centralized data warehouse (BigQuery) back into downstream operational databases (Cloud SQL PostgreSQL) is critical for driving real-time application behavior (e.g., updating merchant discount tiers or user stickiness scores in backend services). Re-syncing the entire dataset harian (Full Load) is highly inefficient, causing performance latencies and high BigQuery compute costs.

I designed and developed a standalone Reverse ETL daemon in Spring Boot and Spring Batch utilizing a Watermark Column Scan to synchronize delta changes incrementally via Full Load, Incremental, and Merge/Upsert modes.


Ingestion Flow & Technical Implementation

sequenceDiagram
    participant Postgres as Cloud SQL (PostgreSQL)
    participant Batch as Spring Batch Engine (Java)
    participant BQ as GCP BigQuery
    
    Batch->>Postgres: 1. Scan Max Watermark (e.g. SELECT MAX(updated_at))
    Postgres-->>Batch: Return max_watermark value
    Batch->>BQ: 2. Query dynamic delta (WHERE updated_at > max_watermark)
    BQ-->>Batch: Stream matching records (BigQueryItemReader)
    Batch->>Postgres: 3. Chunked upsert/write updates (PostgresItemWriter - Merge Mode)
  1. Spring Batch Chunk Processing: Leveraged the Spring Batch framework to implement chunk-oriented processing. This maintains low, stable JVM memory usage even when streaming millions of analytical rows.
  2. Flexible Ingestion Modes:
    • Full Load Mode: Completely refreshes operational SQL tables using a full sweep of BigQuery data.
    • Incremental Mode (Watermark-based): Scans the target database for the latest update timestamp (SELECT MAX(last_sync_time) FROM merchant_analytics_prod) and dynamically scopes the BigQuery read queries to delta rows only.
    • Merge (Upsert) Mode: Dynamically updates existing rows and inserts new ones in target PostgreSQL tables based on unique keys, minimizing primary key I/O overhead.
  3. Resilient Transaction Management: Leveraged Spring Batch’s transaction boundaries to ensure that if a network failure occurs mid-sync, database updates commit cleanly without causing data gaps or duplicate writes.

Business Impact & Success Metrics

  • Compute Optimization: Drastically reduced daily execution times and compute load for Reverse ETL tasks by executing delta-only syncs.
  • Operational Data Integrity: Operational user analytics and merchant indicators in AstraPay’s backend systems remain accurately synchronized harian with the central GCP data warehouse.