Environment:Evidentlyai Evidently Spark Engine Environment

Knowledge Sources	Evidently pyproject.toml spark extras
Domains	Infrastructure, Big_Data
Last Updated	2026-02-14 10:00 GMT

Overview

Optional environment adding PySpark support for running Evidently data drift calculations on distributed Spark DataFrames.

Description

This environment enables the `SparkEngine` calculation backend, allowing Evidently to run data drift metrics directly on PySpark DataFrames without converting them to pandas. This is critical for large-scale datasets that cannot fit in memory on a single machine. The Spark engine supports a subset of Evidently metrics, primarily data drift detection using statistical tests adapted for Spark.

Usage

Use this environment when running data drift detection on large distributed datasets that are stored as Spark DataFrames. Required when data volume exceeds what pandas can handle in memory.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS	Windows requires WSL for Spark
Python	>= 3.10	Same as core environment
Java	JDK 8 or 11	Required by Apache Spark runtime
Hardware	Cluster or single node	Spark can run locally for testing

Dependencies

System Packages

Java Development Kit (JDK 8 or 11)

Python Packages

`pyspark` >= 3.4.0, < 4

Credentials

No additional credentials required beyond the core environment. Spark cluster authentication is handled by the Spark configuration.

Quick Install

# Install Evidently with Spark support
pip install evidently[spark]

Code Evidence

Spark DataFrame type validation from `src/evidently/legacy/spark/engine.py:104-108`:

def convert_input_data(self, data: GenericInputData) -> SparkInputData:
    if not isinstance(data.current_data, SparkDataFrame) or (
        data.reference_data is not None and not isinstance(data.reference_data, SparkDataFrame)
    ):
        raise ValueError("SparkEngine works only with pyspark.sql.DataFrame input data")

Generated features not supported from `src/evidently/legacy/spark/engine.py:131-133`:

def calculate_additional_features(self, data, features, options):
    if len(features) > 0:
        raise NotImplementedError("SparkEngine does not support generated features yet")
    return {}

PySpark type compatibility fallbacks from `src/evidently/legacy/spark/utils.py:28-42`:

try:
    from pyspark.sql.types import CharType
except ImportError:
    CharType = StringType

try:
    from pyspark.sql.types import VarcharType
except ImportError:
    VarcharType = StringType

try:
    from pyspark.sql.types import TimestampNTZType
except ImportError:
    TimestampNTZType = TimestampType

Common Errors

Error Message	Cause	Solution
`ValueError: SparkEngine works only with pyspark.sql.DataFrame input data`	Passed pandas DataFrame to SparkEngine	Use PySpark DataFrames or switch to the default PythonEngine
`NotImplementedError: SparkEngine does not support generated features yet`	Attempted to use descriptors/generated features with Spark	Use PythonEngine for features, or run feature generation separately
`NotImplementedError: '{test}' is not implemented for SparkEngine`	Statistical test not implemented for Spark	Use a Spark-supported stat test (data drift tests have Spark implementations)

Compatibility Notes

Limited metric support: Only data drift metrics have Spark implementations. Classification, regression, and data quality metrics require the PythonEngine (pandas).
No generated features: The SparkEngine does not support generated features or descriptors. These must be computed separately using pandas.
PySpark version compatibility: The code handles missing types (`CharType`, `VarcharType`, `TimestampNTZType`) in older PySpark versions by falling back to base types.

Related Pages

Implementation:Evidentlyai_Evidently_ValueDrift_Metric

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment