Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook DataFrame Write To Table

From Leeroopedia


Overview

DataFrame Write To Table is the principle of persisting Spark DataFrame contents to catalog-managed tables (such as Iceberg or Hive tables). This is the final step in most ETL pipelines, where transformation results are materialized for downstream consumption.

Theory

Spark provides two primary mechanisms for writing DataFrames to tables, each with distinct semantics:

Write Modes

The mode() method controls how existing data in the target table is handled:

  • overwrite - replaces all existing data in the target table (or partition) with the contents of the DataFrame. This is appropriate for full-refresh patterns or idempotent reruns.
  • append - adds new rows to the target table without modifying existing data. This is appropriate for incremental loads.
  • error / errorifexists - (default) raises an exception if the target table already contains data, preventing accidental overwrites.
  • ignore - silently skips the write if the target table already exists, similar to CREATE TABLE IF NOT EXISTS.

insertInto vs saveAsTable

Spark offers two methods for writing to tables, with important differences:

  • insertInto(tableName) - writes data to an existing table by column position (not by name). The DataFrame columns must match the table schema in order and type. This is the preferred method when the table already exists and the schema is stable.
  • saveAsTable(tableName) - creates the table if it does not exist, or replaces/appends based on the mode. It matches columns by name. This is useful for initial table creation but may have unintended side effects on existing tables.

When to Apply

DataFrame write operations are used when:

  • Persisting transformation results to Iceberg or Hive catalog-managed tables
  • Completing the final step of an ETL pipeline
  • Writing intermediate results for downstream jobs to consume
  • Materializing views or aggregations for query performance

Related Pages

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment