Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pytorch Serve EC2 ASG CloudFormation

From Leeroopedia

Overview

EC2_ASG_CloudFormation is an AWS CloudFormation template that deploys a production-ready, multi-instance TorchServe cluster behind an Application Load Balancer with Auto Scaling Group (ASG). The template provisions a complete VPC, subnets, ALB with three target groups (inference on port 8080, management on port 8081, metrics on port 8082), an ASG with 3-5 instances, EFS-based shared model store, CloudWatch monitoring, and CPU-based scaling policies that scale up above 90% utilization and scale down below 70%.

Field Value
Implementation Name EC2_ASG_CloudFormation
Type Infrastructure as Code
Workflow Cloud_Deployment
Domains Cloud_Infrastructure, Model_Serving
Knowledge Sources Pytorch_Serve
Last Updated 2026-02-13 18:52 GMT

Description

This CloudFormation template automates the deployment of a horizontally scalable TorchServe inference cluster on AWS. It is designed for production workloads where high availability and elastic scaling are required. The architecture uses an Application Load Balancer to distribute traffic across multiple EC2 instances, each running TorchServe, with a shared EFS filesystem for model artifacts.

Key Resources

  • VPC and Networking: Creates a dedicated VPC with public subnets across availability zones, internet gateway, and route tables
  • Application Load Balancer (ALB): Routes traffic to three distinct target groups:
    • Inference Target Group (port 8080): Handles prediction requests
    • Management Target Group (port 8081): Handles model registration and scaling
    • Metrics Target Group (port 8082): Exposes Prometheus-compatible metrics
  • Auto Scaling Group (ASG): Maintains 3-5 EC2 instances running TorchServe, scaling based on CPU utilization
  • EFS Shared Model Store: Provides a shared filesystem so all instances access the same model artifacts without duplication
  • CloudWatch Monitoring: Collects metrics for scaling decisions and operational visibility

Scaling Policies

Policy Trigger Action
Scale Up CPU utilization > 90% Add instances (up to max 5)
Scale Down CPU utilization < 70% Remove instances (down to min 3)

Parameters

Parameter Description Required
KeyName EC2 key pair name for SSH access Yes
InstanceType EC2 instance type (e.g., g4dn.xlarge for GPU inference) Yes
ModelPath S3 path or local path for model artifacts to load into EFS Yes

Code Reference

Source Location

File Lines Repository
examples/cloudformation/ec2-asg.yaml L1-648 pytorch/serve

Usage

Deploy the template using the AWS CLI:

# Deploy the CloudFormation stack
aws cloudformation create-stack \
  --stack-name torchserve-asg \
  --template-body file://examples/cloudformation/ec2-asg.yaml \
  --parameters \
    ParameterKey=KeyName,ParameterValue=my-key-pair \
    ParameterKey=InstanceType,ParameterValue=g4dn.xlarge \
    ParameterKey=ModelPath,ParameterValue=s3://my-bucket/models/ \
  --capabilities CAPABILITY_IAM

Template Structure (Excerpt)

AWSTemplateFormatVersion: '2010-09-09'
Description: Multi-instance TorchServe with ALB and ASG

Parameters:
  KeyName:
    Type: AWS::EC2::KeyPair::KeyName
    Description: EC2 key pair for SSH access
  InstanceType:
    Type: String
    Default: g4dn.xlarge
    Description: EC2 instance type
  ModelPath:
    Type: String
    Description: Path to model artifacts

Resources:
  # VPC and Networking
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16

  # Application Load Balancer
  ALB:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Scheme: internet-facing
      Type: application

  # Target Groups for each TorchServe port
  InferenceTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8080
      Protocol: HTTP

  ManagementTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8081
      Protocol: HTTP

  MetricsTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 8082
      Protocol: HTTP

  # Auto Scaling Group
  TorchServeASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: '3'
      MaxSize: '5'
      DesiredCapacity: '3'

  # EFS for shared model store
  ModelStoreEFS:
    Type: AWS::EFS::FileSystem

  # Scaling Policies
  ScaleUpPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      ScalingAdjustment: 1

  CPUAlarmHigh:
    Type: AWS::CloudWatch::Alarm
    Properties:
      MetricName: CPUUtilization
      Threshold: 90
      ComparisonOperator: GreaterThanThreshold

  ScaleDownPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      ScalingAdjustment: -1

  CPUAlarmLow:
    Type: AWS::CloudWatch::Alarm
    Properties:
      MetricName: CPUUtilization
      Threshold: 70
      ComparisonOperator: LessThanThreshold

I/O Contract

Input Type Description
CloudFormation Parameters YAML key-value pairs KeyName, InstanceType, ModelPath
Output Type Description
ALB DNS Name String Public DNS endpoint for inference requests (port 8080)
Management Endpoint String ALB DNS endpoint for management API (port 8081)
Metrics Endpoint String ALB DNS endpoint for Prometheus metrics (port 8082)
ASG Name String Name of the Auto Scaling Group for operational reference

Usage Examples

Example 1: Send inference request through ALB

# After stack creation, get the ALB DNS name
ALB_DNS=$(aws cloudformation describe-stacks \
  --stack-name torchserve-asg \
  --query 'Stacks[0].Outputs[?OutputKey==`ALBDNSName`].OutputValue' \
  --output text)

# Send inference request
curl -X POST http://${ALB_DNS}:8080/predictions/resnet-18 \
  -T image.jpg

Example 2: Register model across cluster via management endpoint

# Register model - ALB forwards to one instance, EFS shares model to all
curl -X POST "http://${ALB_DNS}:8081/models?url=resnet-18.mar&initial_workers=1&synchronous=true"

Example 3: Monitor cluster metrics

# Scrape metrics from the metrics endpoint
curl http://${ALB_DNS}:8082/metrics

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment