The dare: rebuild data engineering from first principles in 8 weeks, ~10–15 hrs/week, with today’s tooling — and write up every step.
Start date: 2026-06-01 · Target finish: 2026-07-26 · Cohort reference: DataTalksClub DE Zoomcamp
Progress at a glance
| Wk | Focus | Tools | Notes | Write-up | Status |
|---|---|---|---|---|---|
| 1 | Containers & IaC | Docker, Compose, Postgres, Terraform, GCP | notes | — | 🔜 |
| 2 | Workflow orchestration | Kestra, data lakes | — | — | ⬜ |
| 3 | Data ingestion workshop | dlt, REST APIs, incremental loads | — | — | ⬜ |
| 4 | Data warehousing | BigQuery, partitioning, clustering, BQML | — | — | ⬜ |
| 5 | Analytics engineering | dbt, DuckDB, BigQuery, tests, docs | — | — | ⬜ |
| 6 | Data platform end-to-end | Bruin, data quality | — | — | ⬜ |
| 7 | Batch processing | Apache Spark, DataFrames, SQL | — | — | ⬜ |
| 8 | Streaming + capstone kickoff | Kafka, Kafka Streams, KSQL, Avro | — | — | ⬜ |
| 9+ | Capstone project | everything above | — | — | ⬜ |
Legend: ✅ done · 🟡 in progress · 🔜 next · ⬜ not started
The senior’s angle
I’m not learning these concepts cold — I’m re-deriving them and stress-testing what I think I know. For each module the lens is:
- Build it the course way — no shortcuts, do the homework.
- What’s genuinely new? — Kestra, dlt, DuckDB, and Bruin didn’t exist (or weren’t mainstream) when I learned this. Note what’s changed.
- Where would this break at scale? — connect each toy pipeline back to production reality.
- Teach it back — the weekly write-up is the test. If I can’t explain it simply, I didn’t relearn it.
Week 1 — Containerization & Infrastructure as Code
Goal: a reproducible local data stack and cloud infra defined as code.
- Dockerize a Postgres + ingestion script; load NYC taxi data
- Compose the stack (Postgres + pgAdmin) with
docker compose - GCP project + service account + IAM (least privilege)
- Terraform: GCS bucket + BigQuery dataset,
plan/apply/destroy - Homework submitted
- Write-up: “What Terraform state actually buys you”
Re-derive: image layers & caching, why Compose networks let containers resolve by service name, Terraform’s plan/apply/state lifecycle.
Week 2 — Workflow Orchestration (Kestra)
Goal: schedule and orchestrate the ingestion pipeline; land data in a lake.
- Kestra up via Docker; first declarative (YAML) flow
- Parameterized + scheduled flow; backfills
- Load taxi data → GCS (data lake) → BigQuery
- Homework submitted
- Write-up: “Kestra vs. the Airflow muscle memory”
Re-derive: idempotency, scheduling vs. event triggers, DAG semantics, backfill correctness.
Week 3 — Data Ingestion Workshop (dlt)
Goal: robust, scalable ingestion from APIs.
- Consume a paginated REST API with
dlt - Schema inference & normalization into nested tables
- Incremental / merge loads (only new rows)
- Homework submitted
- Write-up: “Incremental loading patterns, ranked”
Re-derive: full vs. incremental vs. CDC, idempotent upserts, schema evolution.
Week 4 — Data Warehousing (BigQuery)
Goal: model for cost and speed in a columnar warehouse.
- External vs. native tables
- Partitioning + clustering — measure bytes scanned before/after
- Query cost & performance tuning
- Touch BigQuery ML (
CREATE MODEL) - Homework submitted
- Write-up: “Partitioning vs. clustering: when each actually helps”
Re-derive: columnar storage, why pruning beats indexing here, slot-based pricing.
Week 5 — Analytics Engineering (dbt)
Goal: turn raw tables into tested, documented, deployable models.
- dbt project against DuckDB locally, then BigQuery
- Staging → marts layering; sources, refs, seeds
- Tests (generic + singular) and docs site
- Deployment / scheduled run
- Homework submitted
- Write-up: “dbt as the discipline I should’ve always had”
Re-derive: ELT vs. ETL, dimensional modeling, DAG of refs, test-as-contract.
Week 6 — Data Platform End-to-End (Bruin)
Goal: one tool, full pipeline — ingest, transform, quality, deploy.
- Build an end-to-end Bruin pipeline to BigQuery
- Built-in data quality checks
- Cloud deployment
- Homework submitted
- Write-up: “Where an all-in-one platform helps vs. best-of-breed”
Re-derive: data quality dimensions, contracts, the build-vs-buy line.
Week 7 — Batch Processing (Apache Spark)
Goal: process data that doesn’t fit on one machine.
- Spark DataFrames + Spark SQL
- GroupBy and Join internals (shuffles, partitions)
- Run a job on the taxi dataset
- Homework submitted
- Write-up: “Reading a Spark execution plan without fear”
Re-derive: lazy evaluation, narrow vs. wide transforms, shuffle cost, skew.
Week 8 — Streaming (Kafka) + Capstone kickoff
Goal: move from batch to unbounded data; scope the capstone.
- Kafka producers/consumers; topics & partitions
- Kafka Streams / KSQL
- Avro + schema registry
- Homework submitted
- Capstone proposal drafted
- Write-up: “Exactly-once is a lie I now understand”
Re-derive: log-based messaging, partitions & ordering, delivery semantics, windowing.
Week 9+ — Capstone Project
Goal: one end-to-end pipeline that uses the whole stack and gets peer-reviewed.
- Pick a dataset + a real question
- Batch (or streaming) ingestion → lake → warehouse
- IaC + orchestration + dbt models + a dashboard
- README, diagram, reproducible setup
- Submit for peer review
- Write-up: “What 8 weeks of relearning changed about how I build”