Statusgraph

A status page for your distributed system.

TLDR;

Try the UI (without colors):

$ docker run -it -p 8000:8000 quay.io/moolen/statusgraph:0.1.0 server

Overview

This is a webapp that let's you visualize your system: create nodes and edges to draw your system architecture and signify dependencies. Annotate your services with Metrics and Alerts via Prometheus and Alertmanager.

Conceptually, you want to know if your service is "running", i.e. it is in a binary state: red lamp vs. green lamp. This question is incredibly hard to answer. Statusgraph taks this approach: you define alerts via Prometheus which indicate a red/yellow lamp (service is dead / not available / has issues ..). Additionally, you can map metrics

Alert Example:

- alert: service_down
    expr: up == 0
    labels:
      severity: critical
      service_id: "{{ $labels.service_id }}" # this is known at alert-time
    annotations:
      description: Service {{ $labels.instance }} is unavailable.
      runbook: "http://example.com/foobar"

Requirements

alertmanager v0.20.0 and above
prometheus

use-cases

You can visualize many different aspects of your environment.

10.000ft view of your distributed system
self-contained system of a single team (a bunch of services, databases)
network aspects: CDN, DNS & Edge services
end-user view: edge services, blackbox tests
Data engineering pipeline: visualize DAGs / ETL Metrics

Components

Server

communicates with prometheus to map metrics to a particular service (think: availability, error rate)
asks alertmanager for active alerts

Server Configuration

contains the configuration for upstream
contains the mapping for alerts and metrics

upstream:
  prometheus:
    url: http://localhost:9090
  alertmanager:
    url: http://localhost:9093

mapping:
  # this defines how we select alerts to display
  # use a `labelSelector` to filter
  # and `map` to specify the lookup key in the alert struct
  alerts:
    label_selector:
      - severity: "critical"
      - severity: "warning"
        important: "true"

    # red & green lamp indicator
    # Use this if your alerts use a specific label for a service (e.g. app=frontend / app=backend ...)
    # this tells statusgraph to map alerts to nodes using the following labels/annotations
    service_labels:
      - "service_id"
    service_annotations:
      - "statusgraph-node"

  metrics:

    # green lamp indicator!
    # this helps statusgraph to find all existing services by fetching the label values
    # reference: https://prometheus.io/docs/prometheus/latest/querying/api/#querying-label-values
    service_labels:
      - 'service_id'

    queries:
      # just as an example
      - name: cpu wait
        query: sum(rate(node_pressure_cpu_waiting_seconds_total[1m])) by (service_id) * 100
        service_label: service_id

Roadmap

graph import & streaming

i want to import the graph configuration from different file formats (plantuml, dot..)
right now the graph configuration is static. This works for a logical representation. But computing environments are very dynamic, so i want to stream the graph configuration via an API
- do we need a hybrid approach? (cluster per dynamic-api AND static config?)
- which upstream API to spike? How do we determine the edges? kubernetes/$CLOUD?
- can we use traces (L3/4: tcp/udp/ip via eBPF, L7 via opentracing?) to determine the nodes and edges?

further customization

as a user i want to cross-reference other services (e.g. grafana) from the tooltip (e.g. link to dashboard, runbook etc.)

TODO

add direction arrow to edge
highlight adjacent nodes & edges
graph-config library
- implement config library with shapes, consider using draw.io shapes (AWS/GCP..)
Misc. optimizations
- metrics & alerts caching
- decouple client and upstream requests

Developing

Run Server

$ make binary
$ ./bin/statusgraph server --config ./config.yaml

Run Test Infra

$ cd hack
$ docker-compose up  -d

# test failure
$ docker-compose stop cart.svc

Run Client

$ cd client; npm install; npm run watch

You can access prometheus via localhost:9090, alertmanager via localhost:9093 and the backend (which serves the SPA too) via localhost:8000.

# README