Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Evidentlyai Evidently Spark Engine Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Big_Data
Last Updated 2026-02-14 10:00 GMT

Overview

Optional environment adding PySpark support for running Evidently data drift calculations on distributed Spark DataFrames.

Description

This environment enables the `SparkEngine` calculation backend, allowing Evidently to run data drift metrics directly on PySpark DataFrames without converting them to pandas. This is critical for large-scale datasets that cannot fit in memory on a single machine. The Spark engine supports a subset of Evidently metrics, primarily data drift detection using statistical tests adapted for Spark.

Usage

Use this environment when running data drift detection on large distributed datasets that are stored as Spark DataFrames. Required when data volume exceeds what pandas can handle in memory.

System Requirements

Category Requirement Notes
OS Linux, macOS Windows requires WSL for Spark
Python >= 3.10 Same as core environment
Java JDK 8 or 11 Required by Apache Spark runtime
Hardware Cluster or single node Spark can run locally for testing

Dependencies

System Packages

  • Java Development Kit (JDK 8 or 11)

Python Packages

  • `pyspark` >= 3.4.0, < 4

Credentials

No additional credentials required beyond the core environment. Spark cluster authentication is handled by the Spark configuration.

Quick Install

# Install Evidently with Spark support
pip install evidently[spark]

Code Evidence

Spark DataFrame type validation from `src/evidently/legacy/spark/engine.py:104-108`:

def convert_input_data(self, data: GenericInputData) -> SparkInputData:
    if not isinstance(data.current_data, SparkDataFrame) or (
        data.reference_data is not None and not isinstance(data.reference_data, SparkDataFrame)
    ):
        raise ValueError("SparkEngine works only with pyspark.sql.DataFrame input data")

Generated features not supported from `src/evidently/legacy/spark/engine.py:131-133`:

def calculate_additional_features(self, data, features, options):
    if len(features) > 0:
        raise NotImplementedError("SparkEngine does not support generated features yet")
    return {}

PySpark type compatibility fallbacks from `src/evidently/legacy/spark/utils.py:28-42`:

try:
    from pyspark.sql.types import CharType
except ImportError:
    CharType = StringType

try:
    from pyspark.sql.types import VarcharType
except ImportError:
    VarcharType = StringType

try:
    from pyspark.sql.types import TimestampNTZType
except ImportError:
    TimestampNTZType = TimestampType

Common Errors

Error Message Cause Solution
`ValueError: SparkEngine works only with pyspark.sql.DataFrame input data` Passed pandas DataFrame to SparkEngine Use PySpark DataFrames or switch to the default PythonEngine
`NotImplementedError: SparkEngine does not support generated features yet` Attempted to use descriptors/generated features with Spark Use PythonEngine for features, or run feature generation separately
`NotImplementedError: '{test}' is not implemented for SparkEngine` Statistical test not implemented for Spark Use a Spark-supported stat test (data drift tests have Spark implementations)

Compatibility Notes

  • Limited metric support: Only data drift metrics have Spark implementations. Classification, regression, and data quality metrics require the PythonEngine (pandas).
  • No generated features: The SparkEngine does not support generated features or descriptors. These must be computed separately using pandas.
  • PySpark version compatibility: The code handles missing types (`CharType`, `VarcharType`, `TimestampNTZType`) in older PySpark versions by falling back to base types.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment