Heuristic:Recommenders team Recommenders Test Timing Budgets
| Knowledge Sources | |
|---|---|
| Domains | Testing, CI_CD, Infrastructure |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
CI test groups have strict timing budgets: 45 minutes for nightly runs and 15 minutes for PR gates, enforced through manual grouping and documented execution times.
Description
The Recommenders CI pipeline organizes tests into groups that run in parallel on AzureML. Each group has a documented total execution time, and individual test times are recorded as inline comments. Tests are grouped to balance execution time across parallel workers while respecting hard time limits. Several tests have been disabled (`# FIXME`) due to known issues with TensorFlow compatibility (#2018), long execution times (#1731), or flakiness (#1770).
Usage
Apply this heuristic when adding new tests or modifying existing test groups in the CI configuration. Every new test must be assigned to a group that will not exceed the timing budget. Tests that take longer than a few minutes should be placed in nightly groups, not PR gate groups.
The Insight (Rule of Thumb)
- Action: When adding a test, measure its execution time and assign it to a group that stays within budget.
- Value: Nightly maximum per group: 2700 seconds (45 minutes). PR gate maximum per group: 900 seconds (15 minutes).
- Trade-off: Strict budgets prevent CI from becoming a bottleneck but require careful test group management.
- Reference machines:
- GPU tests: Azure STANDARD_NC6S_V2 (6 vCPUs, 112 GB RAM, 1 NVIDIA Tesla P100)
- CPU/Spark tests: Azure Standard_A8m_v2 (8 vCPUs, 64 GiB RAM)
- Every GPU test group must start with `test_gpu_vm` as the first test to verify GPU availability before running expensive tests.
Reasoning
The timing constraints are documented at the top of `tests/ci/azureml_tests/test_groups.py:4-12`:
# NOTE:
# The times on GPU environment have been calculated on an Azure STANDARD_NC6S_V2
# with 6 vCPUs, 112 GB memory, 1 NVIDIA Tesla P100 GPU.
# The times on CPU and Spark environments have been calculated on an Azure
# Standard_A8m_v2 with 8 vCPUs and 64 GiB memory.
# IMPORTANT NOTE:
# FOR NIGHTLY, NO GROUP SHOULD SURPASS 45MIN = 2700s !!!
# FOR PR GATE, NO GROUP SHOULD SURPASS 15MIN = 900s !!!
Known disabled tests (as of current codebase):
- xDeepFM and SUM model tests: TF > 2.10.1 incompatibility (issue #2018)
- Multiple GPU notebook tests: issue #1883 (xdeepfm, naml, nrms, lstur, npa, wide_deep)
- SASRec tests: take too long to run
- NAML quickstart functional: execution time too long (issue #1731)
- Spark tuning test: flaky (issue #1770)
Flaky test handling: PySpark functional tests use `@pytest.mark.flaky(reruns=5, reruns_delay=2)` to handle intermittent failures.