Workflow:PrefectHQ Prefect Per Worker Task Concurrency
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Concurrency_Control, ML_Ops |
| Last Updated | 2026-02-09 22:00 GMT |
Overview
End-to-end process for using Prefect Global Concurrency Limits scoped per worker to control how many tasks can simultaneously consume a shared local resource (such as a GPU), while allowing non-resource-bound tasks to run freely in parallel.
Description
This workflow addresses the problem of resource contention when a worker runs multiple flow runs concurrently. Rather than limiting entire flow runs to run sequentially (which wastes throughput), it applies fine-grained concurrency limits only to the specific tasks that consume scarce resources. Global Concurrency Limits are coordinated by the Prefect server and work across the separate subprocesses that each flow run executes in. By including a worker identifier in the limit name, each machine maintains independent limits.
Key outputs:
- Processed results from a multi-step pipeline where the resource-intensive step is rate-limited
- Maximum throughput for non-resource-bound steps while protecting scarce resources
Scope:
- From work pool and worker configuration through task-level concurrency control
- Applicable to GPU memory, software licenses, local services, or any shared resource
Usage
Execute this workflow pattern when you have a multi-step pipeline where only certain tasks need concurrency limits (e.g., GPU inference, licensed software, memory-intensive processing) and you want to maximize throughput for all other tasks. It is suitable for ML inference pipelines, image processing, and any scenario where a scarce local resource must be shared across concurrent flow runs on the same machine.
Execution Steps
Step 1: Create Global Concurrency Limits
Create a Global Concurrency Limit for each worker machine using the Prefect CLI. The limit name includes the worker identity (e.g., gpu:gpu-1) and the limit value controls how many tasks can acquire the resource simultaneously.
Key considerations:
- Each worker machine gets its own independently-managed limit
- The limit name convention ({resource}:{worker_id}) ensures isolation
- Limit values should match the machine's resource capacity
Step 2: Configure Work Pool and Deployment
Create a work pool (e.g., process type) and deploy the flow. The work pool's concurrency limit controls how many total flow runs can execute on a worker, while the GCL controls the specific resource-bound step.
Key considerations:
- The work pool limit (e.g., 10 concurrent flow runs) is separate from the task-level GCL
- Deploy the flow using prefect deploy to make it available for scheduling
Step 3: Start Workers with Identity
Start workers with a unique WORKER_ID environment variable that matches the GCL name. This identity links the runtime worker process to its corresponding concurrency limit.
Key considerations:
- The WORKER_ID environment variable is read at runtime by the task
- Each worker must have a GCL created with a matching name
- Workers can handle many concurrent flow runs via the --limit flag
Step 4: Execute Non-Limited Tasks Freely
Tasks that do not consume the scarce resource (e.g., downloading data, saving results) run without any concurrency gate. They execute in parallel across all concurrent flow runs on the worker.
Key considerations:
- Network-bound and I/O-bound tasks should not acquire concurrency limits
- These tasks overlap freely to maximize throughput
Step 5: Acquire Concurrency Limit for Resource-Bound Task
The resource-intensive task (e.g., ML model inference) acquires the per-worker Global Concurrency Limit using the concurrency context manager before executing. The server coordinates slot acquisition across all subprocesses on the same worker.
Key considerations:
- The concurrency context manager blocks until a slot is available
- The slot is released when the context manager exits (success or failure)
- The occupy parameter controls how many slots each invocation consumes
Step 6: Return Results
Once the resource-bound task completes and releases its concurrency slot, downstream tasks (e.g., saving results) run immediately. The next queued flow run's resource-bound task can then acquire the freed slot.
Key considerations:
- Results flow through the pipeline without additional coordination
- The pattern maximizes overall throughput while protecting the scarce resource