Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ray project Ray Release Data Tests Config

From Leeroopedia
Knowledge Sources
Domains Release, Testing, Data, Benchmarks
Last Updated 2026-02-13 16:00 GMT

Overview

This file defines the release test configurations for Ray Data, specifying benchmarks for reading, writing, aggregation, groupby, join, shuffle, sort, batch inference, training, iteration, TPCH queries, cross-AZ fault tolerance, and autoscaling workloads.

Description

The release/release_data_tests.yaml file uses a YAML-based test definition format with a DEFAULTS block that sets the working directory, frequency (nightly), team (data), and base cluster configuration (GPU image type with fixed-size CPU compute). Individual tests override these defaults and use matrix expansion for test variants (fixed_size vs. autoscaling, different data formats, shuffle strategies such as sort_shuffle_pull_based and hash_shuffle). Tests specify cluster compute configurations, timeouts, Python scripts to run, and some include chaos testing variations that terminate EC2 instances during execution. Test groups include reading (parquet, images, TFRecords, URIs), writing (parquet), aggregation (count), groupby (aggregate and map_groups), joins (inner, left_outer, right_outer, full_outer), sorting/shuffling, batch inference (image classification, image/text embeddings, multi-stage pipelines), distributed training, iteration (batches, TF, Torch), TPCH queries (Q1, Q6), and cross-AZ RPC fault tolerance.

Usage

Release engineers and data team members modify this file when adding new benchmark tests, adjusting test timeouts, updating scale factors, adding new data format support, changing cluster configurations, or creating new chaos test variations. Tests are run on nightly, weekly, or manual frequencies depending on their resource requirements and stability.

Code Reference

Source Location

  • Repository: Ray
  • File: release/release_data_tests.yaml
  • Lines: 1-843

Signature

- name: DEFAULTS
  group: data-base
  working_dir: nightly_tests/dataset
  frequency: nightly
  team: data
  cluster:
    byod:
      runtime_env:
        - RAY_DATA_DEBUG_RESOURCE_MANAGER=1
      type: gpu
    cluster_compute: fixed_size_cpu_compute.yaml

###############
# Reading tests
###############
- name: "read_parquet_{{scaling}}"
  python: "3.10"
  cluster:
    cluster_compute: "{{scaling}}_cpu_compute.yaml"
  matrix:
    setup:
      scaling: [fixed_size, autoscaling]
  run:
    timeout: 3600
    script: >
      python read_and_consume_benchmark.py ...

Import

Configuration file, consumed by the Ray release test framework. Referenced by the release test runner infrastructure to define and schedule nightly, weekly, and manual benchmark tests for the data team.

I/O Contract

Inputs

Name Type Required Description
S3 benchmark data S3 paths yes Test data stored in s3://ray-benchmark-data and s3://ray-benchmark-data-internal-us-west-2 buckets
Cluster compute configs YAML files yes Cluster sizing definitions (e.g., fixed_size_cpu_compute.yaml, autoscaling_gpu_compute.yaml)
Benchmark scripts Python files yes Test scripts in nightly_tests/dataset/ (e.g., read_and_consume_benchmark.py, sort_benchmark.py)
BYOD scripts shell scripts conditional Post-build scripts like byod_install_mosaicml.sh for specialized dependencies
Python dependency lockfiles lockfiles conditional Pinned dependencies like image_classification_py3.10.lock

Outputs

Name Type Description
Benchmark results metrics Performance metrics (throughput, latency) for each test
Test pass/fail status boolean Whether each benchmark completed within timeout
Release readiness signal aggregate Overall data team release test status for go/no-go decisions

Usage Examples

The file defines tests using matrix expansion and variations:

# Reading benchmark with fixed_size and autoscaling variants
- name: "read_parquet_{{scaling}}"
  python: "3.10"
  cluster:
    cluster_compute: "{{scaling}}_cpu_compute.yaml"
  matrix:
    setup:
      scaling: [fixed_size, autoscaling]
  run:
    timeout: 3600
    script: >
      python read_and_consume_benchmark.py
      s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet
      --format parquet --iter-bundles

# Groupby benchmark with multiple shuffle strategies and column sets
- name: "aggregate_groups_{{scaling}}_{{shuffle_strategy}}_{{columns}}"
  matrix:
    setup:
      scaling: [fixed_size, autoscaling]
      shuffle_strategy: [sort_shuffle_pull_based, hash_shuffle]
      columns:
        - "column08 column13 column14"   # 84 groups
        - "column02 column14"            # 7M groups
  run:
    timeout: 3600
    script: >
      python groupby_benchmark.py --sf 100 --aggregate
      --group-by {{columns}} --shuffle-strategy {{shuffle_strategy}}

# Chaos test with EC2 instance termination during shuffle
- name: random_shuffle_chaos
  working_dir: nightly_tests
  cluster:
    cluster_compute: dataset/autoscaling_all_to_all_compute.yaml
  run:
    timeout: 10800
    prepare: >
      python setup_chaos.py --chaos TerminateEC2Instance
      --kill-interval 600 --max-to-kill 2
    script: >
      python dataset/sort_benchmark.py
      --num-partitions=1000 --partition-size=1e9 --shuffle

# Distributed training with chaos variation
- name: distributed_training
  cluster:
    cluster_compute: dataset/multi_node_train_16_workers.yaml
  run:
    script: >
      python dataset/multi_node_train_benchmark.py
      --num-workers 16 --file-type parquet --use-gpu
  variations:
    - __suffix__: regular
    - __suffix__: chaos
      run:
        prepare: >
          python setup_chaos.py --kill-interval 200 --max-to-kill 1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment