# README
Autoscaling Monitoring Sample
This sample demonstrates three advanced Cadence worker features:
- Worker Poller Autoscaling - Dynamic adjustment of worker poller goroutines based on workload
- Integrated Prometheus Metrics - Real-time metrics collection using Tally with Prometheus reporter
- Autoscaling Metrics - Comprehensive autoscaling behavior metrics exposed via HTTP endpoint
Features
Worker Poller Autoscaling
The worker uses worker.NewV2
with AutoScalerOptions
to enable true autoscaling behavior:
- AutoScalerOptions.Enabled: true - Enables the autoscaling feature
- PollerMinCount: 2 - Minimum number of poller goroutines
- PollerMaxCount: 8 - Maximum number of poller goroutines
- PollerInitCount: 4 - Initial number of poller goroutines
The worker automatically adjusts the number of poller goroutines between the min and max values based on the current workload.
Prometheus Metrics
The sample uses Tally with Prometheus reporter to expose comprehensive metrics:
- Real-time autoscaling metrics - Poller count changes, quota adjustments, wait times
- Worker performance metrics - Task processing rates, poller utilization, queue depths
- Standard Cadence metrics - All metrics automatically emitted by the Cadence Go client
- Sanitized metric names - Prometheus-compatible metric names and labels
Monitoring Dashboards
When running the Cadence server locally with Grafana, you can access the client dashboards at:
Client Dashboards: http://localhost:3000/d/dehkspwgabvuoc/cadence-client
Note: Make sure to select a Domain in Grafana for the dashboards to display data. The dashboards will be empty until a domain is selected from the dropdown.
Prerequisites
- Cadence Server: Running locally with Docker Compose.
- Prometheus: Configured to scrape metrics from the sample.
- Grafana: With Cadence dashboards (included with default Cadence server setup). Dashboards in the latest version of the server.
Quick Start
1. Start the Worker
./bin/autoscaling-monitoring -m worker
The worker automatically exposes metrics at: http://127.0.0.1:8004/metrics
2. Generate Load
./bin/autoscaling-monitoring -m trigger
Configuration
The sample uses a custom configuration system that extends the base Cadence configuration. You can specify a configuration file using the -config
flag:
./bin/autoscaling-monitoring -m worker -config /path/to/config.yaml
Configuration File Structure
# Cadence connection settings
domain: "default"
service: "cadence-frontend"
host: "localhost:7833"
# Prometheus configuration
prometheus:
listenAddress: "127.0.0.1:8004"
# Autoscaling configuration
autoscaling:
# Worker autoscaling settings
pollerMinCount: 2
pollerMaxCount: 8
pollerInitCount: 4
# Load generation settings
loadGeneration:
# Workflow-level settings
workflows: 10 # Number of workflows to start
workflowDelay: 1000 # Delay between starting workflows (milliseconds)
# Activity-level settings (per workflow)
activitiesPerWorkflow: 30 # Number of activities per workflow
batchDelay: 2000 # Delay between activity batches within workflow (milliseconds)
# Activity processing time range (milliseconds)
minProcessingTime: 1000
maxProcessingTime: 6000
Configuration Usage
The configuration values are used throughout the sample:
-
Worker Configuration (
worker_config.go
):pollerMinCount
,pollerMaxCount
,pollerInitCount
→AutoScalerOptions
-
Workflow Configuration (
workflow.go
):activitiesPerWorkflow
→ Number of activities to execute per workflowbatchDelay
→ Delay between activity batches within workflow
-
Activity Configuration (
activities.go
):minProcessingTime
,maxProcessingTime
→ Activity processing time range
-
Prometheus Configuration (integrated):
listenAddress
→ Metrics endpoint port (default: 127.0.0.1:8004)
Default Configuration
If no configuration file is provided or if the file cannot be read, the sample uses these defaults:
domain: "default"
service: "cadence-frontend"
host: "localhost:7833"
prometheus:
listenAddress: "127.0.0.1:8004"
autoscaling:
pollerMinCount: 2
pollerMaxCount: 8
pollerInitCount: 4
loadGeneration:
workflows: 10
workflowDelay: 1000
activitiesPerWorkflow: 30
batchDelay: 2000
minProcessingTime: 1000
maxProcessingTime: 6000
Load Pattern Examples
The sample supports various load patterns for testing autoscaling behavior:
1. Gradual Ramp-up (Default)
loadGeneration:
workflows: 10
workflowDelay: 1000
activitiesPerWorkflow: 30
Result: 10 workflows starting 1 second apart, each with 30 activities (300 total activities)
2. Burst Load
loadGeneration:
workflows: 25
workflowDelay: 0
activitiesPerWorkflow: 60
Result: 25 workflows all starting immediately (1500 total activities)
3. Sustained Load
loadGeneration:
workflows: 50
workflowDelay: 2000
activitiesPerWorkflow: 100
Result: 5 long-running workflows with 2-second delays between starts (5000 total activities)
4. Light Load
loadGeneration:
workflows: 1
workflowDelay: 0
activitiesPerWorkflow: 20
Result: Single workflow with 20 activities for minimal load testing
Monitoring
Metrics Endpoints
- Prometheus Metrics: http://127.0.0.1:8004/metrics
- Exposed automatically when running worker mode only
- Real-time autoscaling and worker performance metrics
- Prometheus-compatible format with sanitized names
- Note: Metrics server is not started in trigger mode
Grafana Dashboard
Access the Cadence client dashboard at: http://localhost:3000/d/dehkspwgabvuoc/cadence-client
Key Metrics to Monitor
-
Worker Performance Metrics:
cadence_worker_decision_poll_success_count
- Successful decision task pollscadence_worker_activity_poll_success_count
- Successful activity task pollscadence_worker_decision_poll_count
- Total decision task poll attemptscadence_worker_activity_poll_count
- Total activity task poll attempts
-
Autoscaling Behavior Metrics:
cadence_worker_poller_count
- Number of active poller goroutines (key autoscaling indicator)cadence_concurrency_auto_scaler_poller_quota
- Current poller quota for autoscalingcadence_concurrency_auto_scaler_poller_wait_time
- Time pollers wait for taskscadence_concurrency_auto_scaler_scale_up_count
- Number of scale-up eventscadence_concurrency_auto_scaler_scale_down_count
- Number of scale-down events
How It Works
Load Generation
The sample creates multiple workflows that execute activities in parallel, with each workflow:
- Starting with configurable delays (
workflowDelay
) to create sustained load patterns - Executing a configurable number of activities (
activitiesPerWorkflow
) per workflow - Each activity taking 1-6 seconds to complete (configurable via
minProcessingTime
/maxProcessingTime
) - Recording metrics about execution time
- Creating varying load patterns with configurable batch delays within each workflow
Autoscaling Demonstration
The worker uses worker.NewV2
with AutoScalerOptions
to:
- Start with configurable poller goroutines (
pollerInitCount
) - Scale down to minimum pollers (
pollerMinCount
) when load is low - Scale up to maximum pollers (
pollerMaxCount
) when load is high - Automatically adjust based on task queue depth and processing time
Metrics Collection
The sample uses Tally with Prometheus reporter for comprehensive metrics:
- Real-time autoscaling metrics - Poller count changes, quota adjustments, scale events
- Worker performance metrics - Task processing rates, poller utilization, queue depths
- Standard Cadence metrics - All metrics automatically emitted by the Cadence Go client
- Sanitized metric names - Prometheus-compatible format with proper character replacement
Production Considerations
Scaling
- Adjust
pollerMinCount
,pollerMaxCount
, andpollerInitCount
based on your workload - Monitor worker performance and adjust autoscaling parameters
- Use multiple worker instances for high availability
Monitoring
- Configure Prometheus to scrape metrics regularly (latest version of Cadence server is configured to do this)
- Set up alerts for worker performance issues
- Use Grafana dashboards to visualize autoscaling behavior
- Monitor poller count changes to verify autoscaling is working
Security
- Secure the Prometheus endpoint in production
- Use authentication for metrics access
- Consider using HTTPS for metrics endpoints
Testing
The sample includes unit tests for the configuration loading functionality. Run these tests if you make any changes to the config:
Running Tests
# Run all tests
go test -v
# Run specific test
go test -v -run TestLoadConfiguration_SuccessfulLoading
# Run tests with coverage
go test -v -cover
Test Coverage
The tests cover:
- Successful configuration loading - Complete YAML files with all fields
- Missing file fallback - Graceful handling when config file doesn't exist
- Default value application - Ensuring all fields have sensible defaults
Configuration Testing
The tests validate that the improved configuration system:
- Handles embedded struct issues properly
- Applies defaults correctly for missing fields
- Provides clear error messages for configuration problems
- Maintains backward compatibility
Troubleshooting
Common Issues
-
Worker Not Starting:
- Check Cadence server is running
- Verify domain exists
- Check configuration file
- Ensure using compatible Cadence client version
-
Autoscaling Not Working:
- Verify
worker.NewV2
is being used - Check
AutoScalerOptions.Enabled
is true - Monitor poller count changes in logs
- Ensure sufficient load is being generated
- Verify
-
Configuration Issues:
- Verify configuration file path is correct
- Check YAML syntax in configuration file
- Review default values if config file is not found
-
Metrics Not Appearing:
- Verify worker is running (metrics are exposed automatically)
- Check metrics endpoint is accessible: http://127.0.0.1:8004/metrics
- Ensure Prometheus is configured to scrape the endpoint
- Check for metric name sanitization issues
-
Dashboard Not Loading:
- Verify Grafana is running
- Check dashboard URL is correct
- Ensure Prometheus data source is configured