Implementation:Pytorch Serve EC2 ASG CloudFormation

Overview

EC2_ASG_CloudFormation is an AWS CloudFormation template that deploys a production-ready, multi-instance TorchServe cluster behind an Application Load Balancer with Auto Scaling Group (ASG). The template provisions a complete VPC, subnets, ALB with three target groups (inference on port 8080, management on port 8081, metrics on port 8082), an ASG with 3-5 instances, EFS-based shared model store, CloudWatch monitoring, and CPU-based scaling policies that scale up above 90% utilization and scale down below 70%.

Field	Value
Implementation Name	EC2_ASG_CloudFormation
Type	Infrastructure as Code
Workflow	Cloud_Deployment
Domains	Cloud_Infrastructure, Model_Serving
Knowledge Sources	Pytorch_Serve
Last Updated	2026-02-13 18:52 GMT

Description

This CloudFormation template automates the deployment of a horizontally scalable TorchServe inference cluster on AWS. It is designed for production workloads where high availability and elastic scaling are required. The architecture uses an Application Load Balancer to distribute traffic across multiple EC2 instances, each running TorchServe, with a shared EFS filesystem for model artifacts.

Key Resources

VPC and Networking: Creates a dedicated VPC with public subnets across availability zones, internet gateway, and route tables
Application Load Balancer (ALB): Routes traffic to three distinct target groups:
- Inference Target Group (port 8080): Handles prediction requests
- Management Target Group (port 8081): Handles model registration and scaling
- Metrics Target Group (port 8082): Exposes Prometheus-compatible metrics
Auto Scaling Group (ASG): Maintains 3-5 EC2 instances running TorchServe, scaling based on CPU utilization
EFS Shared Model Store: Provides a shared filesystem so all instances access the same model artifacts without duplication
CloudWatch Monitoring: Collects metrics for scaling decisions and operational visibility

Scaling Policies

Policy	Trigger	Action
Scale Up	CPU utilization > 90%	Add instances (up to max 5)
Scale Down	CPU utilization < 70%	Remove instances (down to min 3)

Parameters

Parameter	Description	Required
`KeyName`	EC2 key pair name for SSH access	Yes
`InstanceType`	EC2 instance type (e.g., `g4dn.xlarge` for GPU inference)	Yes
`ModelPath`	S3 path or local path for model artifacts to load into EFS	Yes

Code Reference

Source Location

File	Lines	Repository
`examples/cloudformation/ec2-asg.yaml`	L1-648	pytorch/serve

Usage

Deploy the template using the AWS CLI:

# Deploy the CloudFormation stack
aws cloudformation create-stack \
  --stack-name torchserve-asg \
  --template-body file://examples/cloudformation/ec2-asg.yaml \
  --parameters \
    ParameterKey=KeyName,ParameterValue=my-key-pair \
    ParameterKey=InstanceType,ParameterValue=g4dn.xlarge \
    ParameterKey=ModelPath,ParameterValue=s3://my-bucket/models/ \
  --capabilities CAPABILITY_IAM

Template Structure (Excerpt)

AWSTemplateFormatVersion: '2010-09-09'
Description: Multi-instance TorchServe with ALB and ASG

Parameters:
  KeyName:
    Type: AWS::EC2::KeyPair::KeyName
    Description: EC2 key pair for SSH access
  InstanceType:
    Type: String
    Default: g4dn.xlarge
    Description: EC2 instance type
  ModelPath:
    Type: String
    Description: Path to model artifacts

Resources:
  # VPC and Networking
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16

  # Application Load Balancer
  ALB:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Scheme: internet-facing
      Type: application

  # Target Groups for each TorchServe port
  InferenceTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8080
      Protocol: HTTP

  ManagementTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8081
      Protocol: HTTP

  MetricsTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8082
      Protocol: HTTP

  # Auto Scaling Group
  TorchServeASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: '3'
      MaxSize: '5'
      DesiredCapacity: '3'

  # EFS for shared model store
  ModelStoreEFS:
    Type: AWS::EFS::FileSystem

  # Scaling Policies
  ScaleUpPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      ScalingAdjustment: 1

  CPUAlarmHigh:
    Type: AWS::CloudWatch::Alarm
    Properties:
      MetricName: CPUUtilization
      Threshold: 90
      ComparisonOperator: GreaterThanThreshold

  ScaleDownPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      ScalingAdjustment: -1

  CPUAlarmLow:
    Type: AWS::CloudWatch::Alarm
    Properties:
      MetricName: CPUUtilization
      Threshold: 70
      ComparisonOperator: LessThanThreshold

I/O Contract

Input	Type	Description
CloudFormation Parameters	YAML key-value pairs	`KeyName`, `InstanceType`, `ModelPath`

Output	Type	Description
ALB DNS Name	String	Public DNS endpoint for inference requests (port 8080)
Management Endpoint	String	ALB DNS endpoint for management API (port 8081)
Metrics Endpoint	String	ALB DNS endpoint for Prometheus metrics (port 8082)
ASG Name	String	Name of the Auto Scaling Group for operational reference

Usage Examples

Example 1: Send inference request through ALB

# After stack creation, get the ALB DNS name
ALB_DNS=$(aws cloudformation describe-stacks \
  --stack-name torchserve-asg \
  --query 'Stacks[0].Outputs[?OutputKey==`ALBDNSName`].OutputValue' \
  --output text)

# Send inference request
curl -X POST http://${ALB_DNS}:8080/predictions/resnet-18 \
  -T image.jpg

Example 2: Register model across cluster via management endpoint

# Register model - ALB forwards to one instance, EFS shares model to all
curl -X POST "http://${ALB_DNS}:8081/models?url=resnet-18.mar&initial_workers=1&synchronous=true"

Example 3: Monitor cluster metrics

# Scrape metrics from the metrics endpoint
curl http://${ALB_DNS}:8082/metrics

Related Pages

Principle:Pytorch_Serve_Cloud_Deployment - Cloud deployment principle this template implements
Implementation:Pytorch_Serve_EC2_CloudFormation - Simpler single-instance CloudFormation template
Implementation:Pytorch_Serve_Management_API - Management API exposed through port 8081 target group
Implementation:Pytorch_Serve_Metrics_API - Metrics API exposed through port 8082 target group

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment