ServerStatus

Yet Another ServerStatus Backend, using Prometheus as datasource

Quick Start

Scraping

First, you should set up node-exporter on each of target hosts, and prometheus or any Prometheus-compatible software like vmagent on the host you like to scarpe metrics on targets

As region, location, virtualization type of target hosts cannot be concluded from exported metrics, so you should set these attributes on prometheus scrape configs

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:

  - job_name: node
    scrape_interval: 15s
    scrape_timeout: 15s

    static_configs:
      - targets: ['1.1.1.1:9100']
        labels:
          hostname: "host-a"
          virt_type: "kvm"
          region: "FR"
          location: "Paris"
      - targets: ['2.2.2.2:9100']
        labels:
          hostname: "host-b"
          virt_type: "kvm"
          region: "JP"
          location: "Osaka"
    metric_relabel_configs:
          - action: labeldrop
            regex: (region|location)

We recommend you adding hostname into target labels to identify hosts, but the auto-generated label instance can also be used to identify hosts, and displayed hostname can be replaced by ServerStatus configurations later

Action labeldrop is used to drop all the specific labels in scraped metrics except metrics that auto-generated by prometheus like up before metrics are stored. It's quite an ugly style but works well for our purpose to keep these extra infomation at a cheaper cost.

For example, for auto-generated metrics, it will be up{hostname="host-a", instance="1.1.1.1:9100", job="node", location="Paris", region="FR", virt_type="kvm"} 1 and those scraped from node-exporter, it will be node_boot_time_seconds{hostname="host-a", instance="1.1.1.1:9100", job="node"} 3333

querying

{
    "version": 1,
    "listen": "127.0.0.1:30000",
    "refresh_interval": 120,	# configuration refresh interval, nodes list will be automated be reloaded 
    "scrape_interval": 5,	# how often we make a query to prometheus datasource
    "log_path": "/path/to/logdir",
    "nodes": {
        "default_data_source": "prometheus_name",
        "id_label": "hostname",	# label name used to identify a host
        "mode": "AUTO",		# AUTO or STATIC, AUTO means hosts will be get from query, and STATIC means the following list will be the source  
        "network_overwrites": {     # for aggerated metrics to calculate the total amount of network traffic
            "enable": true,
            "rx": "node_network_receive_bytes_total:30m_inc",
            "tx": "node_network_transmit_bytes_total:30m_inc",
            "align": "30m"
        },
        "list": [
            {
                "hostname": "host-a",
                "overwrites": {
                    "hostname": "DisplayNameForHostA",
                    "net_devices": [
                        "eth4",
                        "pppoe0"
                    ]
                }
            },
            {
                "hostname": "host-b",
                "overwrites": {
                    "hostname": "DisplayNameForHostB",
                    "net_devices": [
                        "eth3",
                        "eth4",
                        "pppoe0",
                        "pppoe1"
                    ],
                    "billing_date": "2023-09-15T00:00:00+08:00" # network traffic will reset at the day and hour of the month
                }
            }
        ],
        "global_matcher": [
            {
                "label": "job",
                "op": "=",
                "value": "node"
            }
        ]
    },
    "data_sources": [
        {
            "type": "prometheus",
            "name": "prometheus_name",
            "url": "https://127.0.0.1:9090"
        }
    ]
}

If you don't add hostname in prometheus configuration, you can specific id_label as instance here, and fill hostname as 1.1.1.1:9100 or 2.2.2.2:9100 in list, and with a replaced hostname value in overwrites section.

pre-aggregated network metrics

At the end of the month or billing cycle, calculating the total network traffic usage can become a time-consuming task

To mitigate this, you can utilize record rules from VMAlert or Prometheus to aggregate network traffic metrics. For instance, by applying a rule to aggregate the increase in traffic over a 30-minute range, the number of data points will be reduced to 1/120th compared to the original data if your scrape duration is 15 seconds

You can enable this feature in the network_overwrites section

Please refer to ./doc/vm/vmalert_rule.yml for instructions on how to add record rules

systemd and reverse proxy config

Please refer to ./doc/ for detail