Building an Incremental Watermark-Driven Reverse ETL with Spring Batch
How I designed and engineered a custom Java Reverse ETL application to sync processed analytics data from BigQuery back to Cloud SQL PostgreSQL using Spring Batch supporting Full, Incremental, and Merge modes.
Executive Summary
In modern data cycles, synchronizing processed insights from a centralized data warehouse (BigQuery) back into downstream operational databases (Cloud SQL PostgreSQL) is critical for driving real-time application behavior (e.g., updating merchant discount tiers or user stickiness scores in backend services). Re-syncing the entire dataset harian (Full Load) is highly inefficient, causing performance latencies and high BigQuery compute costs.
I designed and developed a standalone Reverse ETL daemon in Spring Boot and Spring Batch utilizing a Watermark Column Scan to synchronize delta changes incrementally via Full Load, Incremental, and Merge/Upsert modes.
Ingestion Flow & Technical Implementation
sequenceDiagram
participant Postgres as Cloud SQL (PostgreSQL)
participant Batch as Spring Batch Engine (Java)
participant BQ as GCP BigQuery
Batch->>Postgres: 1. Scan Max Watermark (e.g. SELECT MAX(updated_at))
Postgres-->>Batch: Return max_watermark value
Batch->>BQ: 2. Query dynamic delta (WHERE updated_at > max_watermark)
BQ-->>Batch: Stream matching records (BigQueryItemReader)
Batch->>Postgres: 3. Chunked upsert/write updates (PostgresItemWriter - Merge Mode)
- Spring Batch Chunk Processing: Leveraged the Spring Batch framework to implement chunk-oriented processing. This maintains low, stable JVM memory usage even when streaming millions of analytical rows.
- Flexible Ingestion Modes:
- Full Load Mode: Completely refreshes operational SQL tables using a full sweep of BigQuery data.
- Incremental Mode (Watermark-based): Scans the target database for the latest update timestamp (
SELECT MAX(last_sync_time) FROM merchant_analytics_prod) and dynamically scopes the BigQuery read queries to delta rows only. - Merge (Upsert) Mode: Dynamically updates existing rows and inserts new ones in target PostgreSQL tables based on unique keys, minimizing primary key I/O overhead.
- Resilient Transaction Management: Leveraged Spring Batch’s transaction boundaries to ensure that if a network failure occurs mid-sync, database updates commit cleanly without causing data gaps or duplicate writes.
Business Impact & Success Metrics
- Compute Optimization: Drastically reduced daily execution times and compute load for Reverse ETL tasks by executing delta-only syncs.
- Operational Data Integrity: Operational user analytics and merchant indicators in AstraPay’s backend systems remain accurately synchronized harian with the central GCP data warehouse.