The dare: rebuild data engineering from first principles in 8 weeks, ~10–15 hrs/week, with today’s tooling — and write up every step.

Start date: 2026-06-01 · Target finish: 2026-07-26 · Cohort reference: DataTalksClub DE Zoomcamp

Progress at a glance

WkFocusToolsNotesWrite-upStatus
1Containers & IaCDocker, Compose, Postgres, Terraform, GCPnotes🔜
2Workflow orchestrationKestra, data lakes
3Data ingestion workshopdlt, REST APIs, incremental loads
4Data warehousingBigQuery, partitioning, clustering, BQML
5Analytics engineeringdbt, DuckDB, BigQuery, tests, docs
6Data platform end-to-endBruin, data quality
7Batch processingApache Spark, DataFrames, SQL
8Streaming + capstone kickoffKafka, Kafka Streams, KSQL, Avro
9+Capstone projecteverything above

Legend: ✅ done · 🟡 in progress · 🔜 next · ⬜ not started


The senior’s angle

I’m not learning these concepts cold — I’m re-deriving them and stress-testing what I think I know. For each module the lens is:

  1. Build it the course way — no shortcuts, do the homework.
  2. What’s genuinely new? — Kestra, dlt, DuckDB, and Bruin didn’t exist (or weren’t mainstream) when I learned this. Note what’s changed.
  3. Where would this break at scale? — connect each toy pipeline back to production reality.
  4. Teach it back — the weekly write-up is the test. If I can’t explain it simply, I didn’t relearn it.

Week 1 — Containerization & Infrastructure as Code

Goal: a reproducible local data stack and cloud infra defined as code.

  • Dockerize a Postgres + ingestion script; load NYC taxi data
  • Compose the stack (Postgres + pgAdmin) with docker compose
  • GCP project + service account + IAM (least privilege)
  • Terraform: GCS bucket + BigQuery dataset, plan/apply/destroy
  • Homework submitted
  • Write-up: “What Terraform state actually buys you”

Re-derive: image layers & caching, why Compose networks let containers resolve by service name, Terraform’s plan/apply/state lifecycle.

Week 2 — Workflow Orchestration (Kestra)

Goal: schedule and orchestrate the ingestion pipeline; land data in a lake.

  • Kestra up via Docker; first declarative (YAML) flow
  • Parameterized + scheduled flow; backfills
  • Load taxi data → GCS (data lake) → BigQuery
  • Homework submitted
  • Write-up: “Kestra vs. the Airflow muscle memory”

Re-derive: idempotency, scheduling vs. event triggers, DAG semantics, backfill correctness.

Week 3 — Data Ingestion Workshop (dlt)

Goal: robust, scalable ingestion from APIs.

  • Consume a paginated REST API with dlt
  • Schema inference & normalization into nested tables
  • Incremental / merge loads (only new rows)
  • Homework submitted
  • Write-up: “Incremental loading patterns, ranked”

Re-derive: full vs. incremental vs. CDC, idempotent upserts, schema evolution.

Week 4 — Data Warehousing (BigQuery)

Goal: model for cost and speed in a columnar warehouse.

  • External vs. native tables
  • Partitioning + clustering — measure bytes scanned before/after
  • Query cost & performance tuning
  • Touch BigQuery ML (CREATE MODEL)
  • Homework submitted
  • Write-up: “Partitioning vs. clustering: when each actually helps”

Re-derive: columnar storage, why pruning beats indexing here, slot-based pricing.

Week 5 — Analytics Engineering (dbt)

Goal: turn raw tables into tested, documented, deployable models.

  • dbt project against DuckDB locally, then BigQuery
  • Staging → marts layering; sources, refs, seeds
  • Tests (generic + singular) and docs site
  • Deployment / scheduled run
  • Homework submitted
  • Write-up: “dbt as the discipline I should’ve always had”

Re-derive: ELT vs. ETL, dimensional modeling, DAG of refs, test-as-contract.

Week 6 — Data Platform End-to-End (Bruin)

Goal: one tool, full pipeline — ingest, transform, quality, deploy.

  • Build an end-to-end Bruin pipeline to BigQuery
  • Built-in data quality checks
  • Cloud deployment
  • Homework submitted
  • Write-up: “Where an all-in-one platform helps vs. best-of-breed”

Re-derive: data quality dimensions, contracts, the build-vs-buy line.

Week 7 — Batch Processing (Apache Spark)

Goal: process data that doesn’t fit on one machine.

  • Spark DataFrames + Spark SQL
  • GroupBy and Join internals (shuffles, partitions)
  • Run a job on the taxi dataset
  • Homework submitted
  • Write-up: “Reading a Spark execution plan without fear”

Re-derive: lazy evaluation, narrow vs. wide transforms, shuffle cost, skew.

Week 8 — Streaming (Kafka) + Capstone kickoff

Goal: move from batch to unbounded data; scope the capstone.

  • Kafka producers/consumers; topics & partitions
  • Kafka Streams / KSQL
  • Avro + schema registry
  • Homework submitted
  • Capstone proposal drafted
  • Write-up: “Exactly-once is a lie I now understand”

Re-derive: log-based messaging, partitions & ordering, delivery semantics, windowing.

Week 9+ — Capstone Project

Goal: one end-to-end pipeline that uses the whole stack and gets peer-reviewed.

  • Pick a dataset + a real question
  • Batch (or streaming) ingestion → lake → warehouse
  • IaC + orchestration + dbt models + a dashboard
  • README, diagram, reproducible setup
  • Submit for peer review
  • Write-up: “What 8 weeks of relearning changed about how I build”