Autoscaling Monitoring Sample

This sample demonstrates three advanced Cadence worker features:

Worker Poller Autoscaling - Dynamic adjustment of worker poller goroutines based on workload
Integrated Prometheus Metrics - Real-time metrics collection using Tally with Prometheus reporter
Autoscaling Metrics - Comprehensive autoscaling behavior metrics exposed via HTTP endpoint

Features

Worker Poller Autoscaling

The worker uses worker.NewV2 with AutoScalerOptions to enable true autoscaling behavior:

AutoScalerOptions.Enabled: true - Enables the autoscaling feature
PollerMinCount: 2 - Minimum number of poller goroutines
PollerMaxCount: 8 - Maximum number of poller goroutines
PollerInitCount: 4 - Initial number of poller goroutines

The worker automatically adjusts the number of poller goroutines between the min and max values based on the current workload.

Prometheus Metrics

The sample uses Tally with Prometheus reporter to expose comprehensive metrics:

Real-time autoscaling metrics - Poller count changes, quota adjustments, wait times
Worker performance metrics - Task processing rates, poller utilization, queue depths
Standard Cadence metrics - All metrics automatically emitted by the Cadence Go client
Sanitized metric names - Prometheus-compatible metric names and labels

Monitoring Dashboards

When running the Cadence server locally with Grafana, you can access the client dashboards at:

Client Dashboards: http://localhost:3000/d/dehkspwgabvuoc/cadence-client

Note: Make sure to select a Domain in Grafana for the dashboards to display data. The dashboards will be empty until a domain is selected from the dropdown.

Prerequisites

Cadence Server: Running locally with Docker Compose.
Prometheus: Configured to scrape metrics from the sample.
Grafana: With Cadence dashboards (included with default Cadence server setup). Dashboards in the latest version of the server.

Quick Start

1. Start the Worker

./bin/autoscaling-monitoring -m worker

The worker automatically exposes metrics at: http://127.0.0.1:8004/metrics

2. Generate Load

./bin/autoscaling-monitoring -m trigger

Configuration

The sample uses a custom configuration system that extends the base Cadence configuration. You can specify a configuration file using the -config flag:

./bin/autoscaling-monitoring -m worker -config /path/to/config.yaml

Configuration File Structure

# Cadence connection settings
domain: "default"
service: "cadence-frontend"
host: "localhost:7833"

# Prometheus configuration
prometheus:
  listenAddress: "127.0.0.1:8004"

# Autoscaling configuration
autoscaling:
  # Worker autoscaling settings
  pollerMinCount: 2
  pollerMaxCount: 8
  pollerInitCount: 4
  
  # Load generation settings
  loadGeneration:
    # Workflow-level settings
    workflows: 10             # Number of workflows to start
    workflowDelay: 1000       # Delay between starting workflows (milliseconds)
    
    # Activity-level settings (per workflow)
    activitiesPerWorkflow: 30 # Number of activities per workflow
    batchDelay: 2000          # Delay between activity batches within workflow (milliseconds)
    
    # Activity processing time range (milliseconds)
    minProcessingTime: 1000
    maxProcessingTime: 6000

Configuration Usage

The configuration values are used throughout the sample:

Worker Configuration (worker_config.go):
- pollerMinCount, pollerMaxCount, pollerInitCount → AutoScalerOptions
Workflow Configuration (workflow.go):
- activitiesPerWorkflow → Number of activities to execute per workflow
- batchDelay → Delay between activity batches within workflow
Activity Configuration (activities.go):
- minProcessingTime, maxProcessingTime → Activity processing time range
Prometheus Configuration (integrated):
- listenAddress → Metrics endpoint port (default: 127.0.0.1:8004)

Default Configuration

If no configuration file is provided or if the file cannot be read, the sample uses these defaults:

domain: "default"
service: "cadence-frontend"
host: "localhost:7833"
prometheus:
  listenAddress: "127.0.0.1:8004"
autoscaling:
  pollerMinCount: 2
  pollerMaxCount: 8
  pollerInitCount: 4
  loadGeneration:
    workflows: 10
    workflowDelay: 1000
    activitiesPerWorkflow: 30
    batchDelay: 2000
    minProcessingTime: 1000
    maxProcessingTime: 6000

Load Pattern Examples

The sample supports various load patterns for testing autoscaling behavior:

1. Gradual Ramp-up (Default)

loadGeneration:
  workflows: 10
  workflowDelay: 1000
  activitiesPerWorkflow: 30

Result: 10 workflows starting 1 second apart, each with 30 activities (300 total activities)

2. Burst Load

loadGeneration:
  workflows: 25
  workflowDelay: 0
  activitiesPerWorkflow: 60

Result: 25 workflows all starting immediately (1500 total activities)

3. Sustained Load

loadGeneration:
  workflows: 50
  workflowDelay: 2000
  activitiesPerWorkflow: 100

Result: 5 long-running workflows with 2-second delays between starts (5000 total activities)

4. Light Load

loadGeneration:
  workflows: 1
  workflowDelay: 0
  activitiesPerWorkflow: 20

Result: Single workflow with 20 activities for minimal load testing

Monitoring

Metrics Endpoints

Prometheus Metrics: http://127.0.0.1:8004/metrics
- Exposed automatically when running worker mode only
- Real-time autoscaling and worker performance metrics
- Prometheus-compatible format with sanitized names
- Note: Metrics server is not started in trigger mode

Grafana Dashboard

Access the Cadence client dashboard at: http://localhost:3000/d/dehkspwgabvuoc/cadence-client

Key Metrics to Monitor

Worker Performance Metrics:
- cadence_worker_decision_poll_success_count - Successful decision task polls
- cadence_worker_activity_poll_success_count - Successful activity task polls
- cadence_worker_decision_poll_count - Total decision task poll attempts
- cadence_worker_activity_poll_count - Total activity task poll attempts
Autoscaling Behavior Metrics:
- cadence_worker_poller_count - Number of active poller goroutines (key autoscaling indicator)
- cadence_concurrency_auto_scaler_poller_quota - Current poller quota for autoscaling
- cadence_concurrency_auto_scaler_poller_wait_time - Time pollers wait for tasks
- cadence_concurrency_auto_scaler_scale_up_count - Number of scale-up events
- cadence_concurrency_auto_scaler_scale_down_count - Number of scale-down events

How It Works

Load Generation

The sample creates multiple workflows that execute activities in parallel, with each workflow:

Starting with configurable delays (workflowDelay) to create sustained load patterns
Executing a configurable number of activities (activitiesPerWorkflow) per workflow
Each activity taking 1-6 seconds to complete (configurable via minProcessingTime/maxProcessingTime)
Recording metrics about execution time
Creating varying load patterns with configurable batch delays within each workflow

Autoscaling Demonstration

The worker uses worker.NewV2 with AutoScalerOptions to:

Start with configurable poller goroutines (pollerInitCount)
Scale down to minimum pollers (pollerMinCount) when load is low
Scale up to maximum pollers (pollerMaxCount) when load is high
Automatically adjust based on task queue depth and processing time

Metrics Collection

The sample uses Tally with Prometheus reporter for comprehensive metrics:

Real-time autoscaling metrics - Poller count changes, quota adjustments, scale events
Worker performance metrics - Task processing rates, poller utilization, queue depths
Standard Cadence metrics - All metrics automatically emitted by the Cadence Go client
Sanitized metric names - Prometheus-compatible format with proper character replacement

Production Considerations

Scaling

Adjust pollerMinCount, pollerMaxCount, and pollerInitCount based on your workload
Monitor worker performance and adjust autoscaling parameters
Use multiple worker instances for high availability

Monitoring

Configure Prometheus to scrape metrics regularly (latest version of Cadence server is configured to do this)
Set up alerts for worker performance issues
Use Grafana dashboards to visualize autoscaling behavior
Monitor poller count changes to verify autoscaling is working

Security

Secure the Prometheus endpoint in production
Use authentication for metrics access
Consider using HTTPS for metrics endpoints

Testing

The sample includes unit tests for the configuration loading functionality. Run these tests if you make any changes to the config:

Running Tests

# Run all tests
go test -v

# Run specific test
go test -v -run TestLoadConfiguration_SuccessfulLoading

# Run tests with coverage
go test -v -cover

Test Coverage

The tests cover:

Successful configuration loading - Complete YAML files with all fields
Missing file fallback - Graceful handling when config file doesn't exist
Default value application - Ensuring all fields have sensible defaults

Configuration Testing

The tests validate that the improved configuration system:

Handles embedded struct issues properly
Applies defaults correctly for missing fields
Provides clear error messages for configuration problems
Maintains backward compatibility

Troubleshooting

Common Issues

Worker Not Starting:
- Check Cadence server is running
- Verify domain exists
- Check configuration file
- Ensure using compatible Cadence client version
Autoscaling Not Working:
- Verify worker.NewV2 is being used
- Check AutoScalerOptions.Enabled is true
- Monitor poller count changes in logs
- Ensure sufficient load is being generated
Configuration Issues:
- Verify configuration file path is correct
- Check YAML syntax in configuration file
- Review default values if config file is not found
Metrics Not Appearing:
- Verify worker is running (metrics are exposed automatically)
- Check metrics endpoint is accessible: http://127.0.0.1:8004/metrics
- Ensure Prometheus is configured to scrape the endpoint
- Check for metric name sanitization issues
Dashboard Not Loading:
- Verify Grafana is running
- Check dashboard URL is correct
- Ensure Prometheus data source is configured

# README

Autoscaling Monitoring Sample

Features

Worker Poller Autoscaling

Prometheus Metrics

Monitoring Dashboards

Prerequisites

Quick Start

1. Start the Worker

2. Generate Load

Configuration

Configuration File Structure

Configuration Usage

Default Configuration

Load Pattern Examples

1. Gradual Ramp-up (Default)

2. Burst Load

3. Sustained Load

4. Light Load

Monitoring

Metrics Endpoints

Grafana Dashboard

Key Metrics to Monitor

How It Works

Load Generation

Autoscaling Demonstration

Metrics Collection

Production Considerations

Scaling

Monitoring

Security

Testing

Running Tests

Test Coverage

Configuration Testing

Troubleshooting

Common Issues