Overview
EC2_ASG_CloudFormation is an AWS CloudFormation template that deploys a production-ready, multi-instance TorchServe cluster behind an Application Load Balancer with Auto Scaling Group (ASG). The template provisions a complete VPC, subnets, ALB with three target groups (inference on port 8080, management on port 8081, metrics on port 8082), an ASG with 3-5 instances, EFS-based shared model store, CloudWatch monitoring, and CPU-based scaling policies that scale up above 90% utilization and scale down below 70%.
Description
This CloudFormation template automates the deployment of a horizontally scalable TorchServe inference cluster on AWS. It is designed for production workloads where high availability and elastic scaling are required. The architecture uses an Application Load Balancer to distribute traffic across multiple EC2 instances, each running TorchServe, with a shared EFS filesystem for model artifacts.
Key Resources
- VPC and Networking: Creates a dedicated VPC with public subnets across availability zones, internet gateway, and route tables
- Application Load Balancer (ALB): Routes traffic to three distinct target groups:
- Inference Target Group (port 8080): Handles prediction requests
- Management Target Group (port 8081): Handles model registration and scaling
- Metrics Target Group (port 8082): Exposes Prometheus-compatible metrics
- Auto Scaling Group (ASG): Maintains 3-5 EC2 instances running TorchServe, scaling based on CPU utilization
- EFS Shared Model Store: Provides a shared filesystem so all instances access the same model artifacts without duplication
- CloudWatch Monitoring: Collects metrics for scaling decisions and operational visibility
Scaling Policies
| Policy |
Trigger |
Action
|
| Scale Up |
CPU utilization > 90% |
Add instances (up to max 5)
|
| Scale Down |
CPU utilization < 70% |
Remove instances (down to min 3)
|
Parameters
| Parameter |
Description |
Required
|
KeyName |
EC2 key pair name for SSH access |
Yes
|
InstanceType |
EC2 instance type (e.g., g4dn.xlarge for GPU inference) |
Yes
|
ModelPath |
S3 path or local path for model artifacts to load into EFS |
Yes
|
Code Reference
Source Location
| File |
Lines |
Repository
|
examples/cloudformation/ec2-asg.yaml |
L1-648 |
pytorch/serve
|
Usage
Deploy the template using the AWS CLI:
# Deploy the CloudFormation stack
aws cloudformation create-stack \
--stack-name torchserve-asg \
--template-body file://examples/cloudformation/ec2-asg.yaml \
--parameters \
ParameterKey=KeyName,ParameterValue=my-key-pair \
ParameterKey=InstanceType,ParameterValue=g4dn.xlarge \
ParameterKey=ModelPath,ParameterValue=s3://my-bucket/models/ \
--capabilities CAPABILITY_IAM
Template Structure (Excerpt)
AWSTemplateFormatVersion: '2010-09-09'
Description: Multi-instance TorchServe with ALB and ASG
Parameters:
KeyName:
Type: AWS::EC2::KeyPair::KeyName
Description: EC2 key pair for SSH access
InstanceType:
Type: String
Default: g4dn.xlarge
Description: EC2 instance type
ModelPath:
Type: String
Description: Path to model artifacts
Resources:
# VPC and Networking
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
# Application Load Balancer
ALB:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Scheme: internet-facing
Type: application
# Target Groups for each TorchServe port
InferenceTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Port: 8080
Protocol: HTTP
ManagementTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Port: 8081
Protocol: HTTP
MetricsTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Port: 8082
Protocol: HTTP
# Auto Scaling Group
TorchServeASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: '3'
MaxSize: '5'
DesiredCapacity: '3'
# EFS for shared model store
ModelStoreEFS:
Type: AWS::EFS::FileSystem
# Scaling Policies
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
ScalingAdjustment: 1
CPUAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: CPUUtilization
Threshold: 90
ComparisonOperator: GreaterThanThreshold
ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
ScalingAdjustment: -1
CPUAlarmLow:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: CPUUtilization
Threshold: 70
ComparisonOperator: LessThanThreshold
I/O Contract
| Input |
Type |
Description
|
| CloudFormation Parameters |
YAML key-value pairs |
KeyName, InstanceType, ModelPath
|
| Output |
Type |
Description
|
| ALB DNS Name |
String |
Public DNS endpoint for inference requests (port 8080)
|
| Management Endpoint |
String |
ALB DNS endpoint for management API (port 8081)
|
| Metrics Endpoint |
String |
ALB DNS endpoint for Prometheus metrics (port 8082)
|
| ASG Name |
String |
Name of the Auto Scaling Group for operational reference
|
Usage Examples
Example 1: Send inference request through ALB
# After stack creation, get the ALB DNS name
ALB_DNS=$(aws cloudformation describe-stacks \
--stack-name torchserve-asg \
--query 'Stacks[0].Outputs[?OutputKey==`ALBDNSName`].OutputValue' \
--output text)
# Send inference request
curl -X POST http://${ALB_DNS}:8080/predictions/resnet-18 \
-T image.jpg
Example 2: Register model across cluster via management endpoint
# Register model - ALB forwards to one instance, EFS shares model to all
curl -X POST "http://${ALB_DNS}:8081/models?url=resnet-18.mar&initial_workers=1&synchronous=true"
Example 3: Monitor cluster metrics
# Scrape metrics from the metrics endpoint
curl http://${ALB_DNS}:8082/metrics
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.