Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:DataTalksClub Data engineering zoomcamp DuckDB OOM Memory Management

From Leeroopedia




Knowledge Sources
Domains Analytics_Engineering, Optimization
Last Updated 2026-02-09 07:00 GMT

04-analytics-engineering/setup/duckdb_troubleshooting.md 04-analytics-engineering/setup/local_setup.md

Overview

Memory optimization strategies for avoiding DuckDB Out of Memory errors when running dbt builds on large NYC taxi datasets, covering memory_limit tuning, thread reduction, and selective model building.

Description

DuckDB is an in-process database that runs entirely within the host machine's RAM. When dbt builds models that process tens of millions of taxi trip records, operations like `QUALIFY` with window functions, `UNION ALL` on large tables, surrogate key generation, and `JOIN` on fact tables can exhaust available memory. This heuristic captures the tuning strategies, recovery techniques, and best practices documented in the repository for managing DuckDB memory pressure.

Usage

Use this heuristic when you are experiencing Out of Memory errors during `dbt build` with DuckDB, or when setting up a new local analytics engineering environment to proactively prevent OOM issues. Apply these settings especially on machines with less than 16GB RAM.

The Insight (Rule of Thumb)

  • Action 1: Set `memory_limit` in `profiles.yml` to roughly 50% of total system RAM (not the default 80%).
    • Value: 4GB RAM machine -> `'1GB'`; 8GB machine -> `'4GB'`; 16GB+ machine -> default is fine.
    • Trade-off: Lower limit means slower builds but prevents OOM kills by leaving room for OS and other applications.
  • Action 2: Set `threads: 1` to build one model at a time.
    • Value: `threads: 1` in `profiles.yml`.
    • Trade-off: Slower parallel execution but significantly lower peak memory usage.
  • Action 3: Set `preserve_insertion_order: false` to skip maintaining row order.
    • Value: Boolean flag in DuckDB settings.
    • Trade-off: Row order may differ between runs, but saves memory during table writes.
  • Action 4: Use `dbt retry` after a failure to continue from the last checkpoint.
    • Value: Run `dbt retry` instead of `dbt build` after OOM.
    • Trade-off: None. Skips already-succeeded models, reducing total memory pressure.
  • Action 5: Build models selectively with `dbt build --select <model_name> --target prod`.
    • Value: One model at a time in sequence.
    • Trade-off: Manual invocation required but DuckDB only handles one model at a time.

Reasoning

DuckDB runs in-process and its default configuration claims 80% of system RAM. On a machine running a browser, IDE, and other applications, the OS may kill the DuckDB process when combined memory exceeds physical RAM. The specific operations that cause peak memory in this project are:

Operation Why Expensive Where
`QUALIFY` with window functions Sorts and partitions entire dataset in memory `int_trips.sql`
`UNION ALL` on large tables Combines two large datasets into one `int_trips_unioned.sql`
Surrogate key generation Computes hashes across full dataset `int_trips.sql`
`JOIN` on large fact tables Expands memory footprint during enrichment `fct_trips.sql`

RAM guidelines from the troubleshooting documentation:

  • 4 GB RAM: Will very likely hit OOM. Use GitHub Codespaces or Cloud Setup instead.
  • 8 GB RAM: May hit OOM on some models. Adjust memory settings.
  • 16+ GB RAM: Should be fine with default settings.

Additional best practices:

  • Close other applications before running `dbt build` to free RAM.
  • Use an SSD: DuckDB spills to disk when memory is exhausted; SSDs make spill operations much faster.
  • Avoid Docker: Docker containers impose memory limits that may be lower than system RAM. Run directly on host if possible.

Code Evidence

DuckDB profile configuration from `local_setup.md:48-79`:

taxi_rides_ny:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: taxi_rides_ny.duckdb
      schema: dev
      threads: 1
      extensions:
        - parquet
      settings:
        memory_limit: '2GB'
        preserve_insertion_order: false

Troubleshooting comment from `local_setup.md:76-79`:

# Troubleshooting:
# - If you have less than 4GB RAM, try setting memory_limit to '1GB'
# - If you have 16GB+ RAM, you can increase to '4GB' for faster builds
# - Expected build time: 5-10 minutes on most systems

Selective model building from `duckdb_troubleshooting.md:74-80`:

dbt build --select stg_yellow_tripdata --target prod
dbt build --select stg_green_tripdata --target prod
dbt build --select int_trips_unioned --target prod
dbt build --select int_trips --target prod
dbt build --select fct_trips --target prod

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment