Roadmap & Progress

The dare: rebuild data engineering from first principles in 8 weeks, ~10–15 hrs/week, with today’s tooling — and write up every step.
Start date: 2026-06-01 · Target finish: 2026-07-26 · Cohort reference: DataTalksClub DE Zoomcamp

Progress at a glance

Wk	Focus	Tools	Notes	Write-up	Status
1	Containers & IaC	Docker, Compose, Postgres, Terraform, GCP	notes	—	🔜
2	Workflow orchestration	Kestra, data lakes	—	—	⬜
3	Data ingestion workshop	dlt, REST APIs, incremental loads	—	—	⬜
4	Data warehousing	BigQuery, partitioning, clustering, BQML	—	—	⬜
5	Analytics engineering	dbt, DuckDB, BigQuery, tests, docs	—	—	⬜
6	Data platform end-to-end	Bruin, data quality	—	—	⬜
7	Batch processing	Apache Spark, DataFrames, SQL	—	—	⬜
8	Streaming + capstone kickoff	Kafka, Kafka Streams, KSQL, Avro	—	—	⬜
9+	Capstone project	everything above	—	—	⬜

Legend: ✅ done · 🟡 in progress · 🔜 next · ⬜ not started

The senior’s angle

I’m not learning these concepts cold — I’m re-deriving them and stress-testing what I think I know. For each module the lens is:

Build it the course way — no shortcuts, do the homework.
What’s genuinely new? — Kestra, dlt, DuckDB, and Bruin didn’t exist (or weren’t mainstream) when I learned this. Note what’s changed.
Where would this break at scale? — connect each toy pipeline back to production reality.
Teach it back — the weekly write-up is the test. If I can’t explain it simply, I didn’t relearn it.

Week 1 — Containerization & Infrastructure as Code

Goal: a reproducible local data stack and cloud infra defined as code.

Dockerize a Postgres + ingestion script; load NYC taxi data
Compose the stack (Postgres + pgAdmin) with docker compose
GCP project + service account + IAM (least privilege)
Terraform: GCS bucket + BigQuery dataset, plan/apply/destroy
Homework submitted
Write-up: “What Terraform state actually buys you”

Re-derive: image layers & caching, why Compose networks let containers resolve by service name, Terraform’s plan/apply/state lifecycle.

Week 2 — Workflow Orchestration (Kestra)

Goal: schedule and orchestrate the ingestion pipeline; land data in a lake.

Kestra up via Docker; first declarative (YAML) flow
Parameterized + scheduled flow; backfills
Load taxi data → GCS (data lake) → BigQuery
Homework submitted
Write-up: “Kestra vs. the Airflow muscle memory”

Re-derive: idempotency, scheduling vs. event triggers, DAG semantics, backfill correctness.

Week 3 — Data Ingestion Workshop (dlt)

Goal: robust, scalable ingestion from APIs.

Consume a paginated REST API with dlt
Schema inference & normalization into nested tables
Incremental / merge loads (only new rows)
Homework submitted
Write-up: “Incremental loading patterns, ranked”

Re-derive: full vs. incremental vs. CDC, idempotent upserts, schema evolution.

Week 4 — Data Warehousing (BigQuery)

Goal: model for cost and speed in a columnar warehouse.

External vs. native tables
Partitioning + clustering — measure bytes scanned before/after
Query cost & performance tuning
Touch BigQuery ML (CREATE MODEL)
Homework submitted
Write-up: “Partitioning vs. clustering: when each actually helps”

Re-derive: columnar storage, why pruning beats indexing here, slot-based pricing.

Week 5 — Analytics Engineering (dbt)

Goal: turn raw tables into tested, documented, deployable models.

dbt project against DuckDB locally, then BigQuery
Staging → marts layering; sources, refs, seeds
Tests (generic + singular) and docs site
Deployment / scheduled run
Homework submitted
Write-up: “dbt as the discipline I should’ve always had”

Re-derive: ELT vs. ETL, dimensional modeling, DAG of refs, test-as-contract.

Week 6 — Data Platform End-to-End (Bruin)

Goal: one tool, full pipeline — ingest, transform, quality, deploy.

Build an end-to-end Bruin pipeline to BigQuery
Built-in data quality checks
Cloud deployment
Homework submitted
Write-up: “Where an all-in-one platform helps vs. best-of-breed”

Re-derive: data quality dimensions, contracts, the build-vs-buy line.

Week 7 — Batch Processing (Apache Spark)

Goal: process data that doesn’t fit on one machine.

Spark DataFrames + Spark SQL
GroupBy and Join internals (shuffles, partitions)
Run a job on the taxi dataset
Homework submitted
Write-up: “Reading a Spark execution plan without fear”

Re-derive: lazy evaluation, narrow vs. wide transforms, shuffle cost, skew.

Week 8 — Streaming (Kafka) + Capstone kickoff

Goal: move from batch to unbounded data; scope the capstone.

Kafka producers/consumers; topics & partitions
Kafka Streams / KSQL
Avro + schema registry
Homework submitted
Capstone proposal drafted
Write-up: “Exactly-once is a lie I now understand”

Re-derive: log-based messaging, partitions & ordering, delivery semantics, windowing.

Week 9+ — Capstone Project

Goal: one end-to-end pipeline that uses the whole stack and gets peer-reviewed.

Pick a dataset + a real question
Batch (or streaming) ingestion → lake → warehouse
IaC + orchestration + dbt models + a dashboard
README, diagram, reproducible setup
Submit for peer review
Write-up: “What 8 weeks of relearning changed about how I build”

Progress at a glance#

The senior’s angle#

Week 1 — Containerization & Infrastructure as Code#

Week 2 — Workflow Orchestration (Kestra)#

Week 3 — Data Ingestion Workshop (dlt)#

Week 4 — Data Warehousing (BigQuery)#

Week 5 — Analytics Engineering (dbt)#

Week 6 — Data Platform End-to-End (Bruin)#

Week 7 — Batch Processing (Apache Spark)#

Week 8 — Streaming (Kafka) + Capstone kickoff#

Week 9+ — Capstone Project#