Cloud Data Infrastructure

BigQuery Cost Intelligence: Optimizing Staging Pipelines and Dataform Incremental ETL

How I built an enterprise BigQuery cost monitoring system and re-engineered internal BigQuery-to-BigQuery ETL workflows using Dataform in Incremental Mode to slash query scan bills and compute costs.

#BigQuery #Dataform #SQLX #INFORMATION_SCHEMA #Cost Optimization #Looker Studio

Executive Summary

As transaction volumes scaled at AstraPay, BigQuery computing bills grew significantly month-over-month. The primary obstacles were a complete lack of visibility into high-cost queries and inefficient legacy BigQuery-to-BigQuery pipelines performing daily Full Refresh operations (re-scanning and overwriting historical tables).

I designed and implemented a centralized BigQuery Cost Monitoring dashboard and completely re-engineered our internal BigQuery-to-BigQuery workflows using Dataform in Incremental Mode, dramatically reducing data scan volumes and operational costs.


Technical Approach & Optimization Steps

1. Real-Time Cost Audit via INFORMATION_SCHEMA

I built a real-time query cost tracking pipeline by querying GCP BigQuery’s internal audit metadata:

  • Extracted query scan metrics (bytes_billed) directly from INFORMATION_SCHEMA.JOBS_BY_PROJECT.
  • Segmented cost data by user_email (query trigger identity), target dataset, billing project, and table.
  • Created an interactive Looker Studio dashboard to detect query anomalies, inefficient joins, and expensive ad-hoc analysis.

2. BigQuery-to-BigQuery Incremental ETL using Dataform

Rather than dropping and re-populating target tables every night, I migrated the internal SQL transform pipelines to Dataform SQLX Incremental Models:

  • SQLX Optimization: Configured conditional syntax using ${when(incremental(), ...)}.
  • Delta Processing: Programmed pipelines to only fetch records with update timestamps newer than the maximum value in the current target table (WHERE updated_at > (SELECT MAX(updated_at) FROM ${self()})).
  • Ingestion Pipeline Restructuring: Shifted Datastream CDC ingestion from dynamic Merge operations to an Append staging model on BigQuery, utilizing scheduled Dataform jobs to incrementally reconcile data into production.

Business Impact & Success Metrics

  • Data Scan Reduction: Eliminated the need for daily full scans of terabyte-scale historical tables. Drastically minimized query scan volumes, leading to a significant reduction in monthly GCP warehousing bills.
  • Instant Dashboard Rendering: With incremental builds and pre-computed staging, business intelligence reports in Looker Studio render metrics instantly without loading delays.