MQ Exporter for Prometheus monitoring

This directory contains the code for a monitoring solution that exports queue manager data to a Prometheus data collection system. It also contains configuration files to run the monitor program

The monitor collects metrics published by an MQ V9 queue manager or the MQ appliance. Prometheus than calls the monitor program at regular intervals to pull those metrics into its database, where they can then be queried directly or used by other packages such as Grafana.

You can see data such as disk or CPU usage, queue depths, and MQI call counts. Channel status is also reported.

Example Grafana dashboards are included, to show how queries might be constructed. To use the dashboard, create a data source in Grafana called "MQ Prometheus" that points at your database server, and then import the JSON file. This dashboard was built using Grafana v5.3.1

There is also a script to start the collector so that it processes the statistics generated by the MQ Bridge for Salesforce, included from MQ V9.0.2

Building

You need to have the MQ client libraries installed first.
Set up an environment for compiling Go programs

  export GOPATH=~/go (or wherever you want to put it)
  export GOROOT=/usr/lib/golang  (or wherever you have installed it)
  mkdir -p $GOPATH/src
  cd $GOPATH/src

Clone this GitHub repository for the monitoring programs into your GOPATH. The repository contains the prereq packages at a suitable version in the vendor tree

  git clone https://github.com/ibm-messaging/mq-metric-samples ibm-messaging/mq-metric-samples

From the root of your GOPATH you can then compile the code

  cd $GOPATH
  export CGO_LDFLAGS_ALLOW='-Wl,-rpath.*'
  go build -o bin/mq_prometheus src/ibm-messaging/mq-metric-samples/cmd/mq_prometheus/*.go

Configuring MQ

It is convenient to run the monitor program as a queue manager service whenever possible.

This directory contains an MQSC script to define the service. In fact, the service definition points at a simple script which sets up any necessary environment and builds the command line parameters for the real monitor program. As the last line of the script is "exec", the process id of the script is inherited by the monitor program, and the queue manager can then check on the status, and can drive a suitable STOP SERVICE operation during queue manager shutdown.

Edit the MQSC script and the shell script to point at appropriate directories where the program exists, and where you want to put stdout/stderr. Ensure that the ID running the queue manager has permission to access the programs and output files.

If you cannot run the monitor as a service, for example when trying to monitor the MQ Appliance which does not support service definitions, or monitoring another platform without Go support, then you can run as an MQ client connecting remotely. Setting the ibmmq.client property to true forces client connections. Then all the usual MQ configuration comes into play (MQSERVER environment variable, use of CCDT files etc).

The monitor listens for calls from Prometheus on a TCP port. The default port, reserved for MQ's use in the Prometheus list, is 9157. If you want to use a different number, then use the -ibmmq.httpListenPort command parameter.

The monitor always collects all of the available queue manager-wide metrics. It can also be configured to collect statistics for specific sets of queues. The sets of queues can be given either directly on the command line with the -ibmmq.monitoredQueues flag, or put into a separate file which is also named on the command line, with the -ibmmq.monitoredQueuesFile flag. An example is included in the startup shell script.

Wildcard Patterns for Queues

The ibmmq.monitoredQueues parameter can include both positive and negative wildcards. For example ibmmq.monitoredQueues=A*,!AB*" will collect data on queues beginning with "AC" or "AD" but not "AB". The full rules for expansion can be seen near the bottom of the discover.go module in the mqmetric package.

The queue patterns are expanded at startup of the program and at regular intervals thereafter. So newly-defined queues will eventually be monitored if they match the pattern. The rediscovery interval is 1h by default, but can be modified by the rediscoverInterval parameter.

Channel Status

The monitor program can now process channel status, reporting that back into Prometheus.

The channels to be monitored are set on the command line, similarly to the queue patterns, with -ibmmq.monitoredChannels or -ibmmq.monitoredChannelFile. Unlike the queue monitoring, wildcards are handled automatically by the channel status API. So you do not need to restart this monitor in order to pick up newly-defined channels that match an existing pattern.

Another command line parameter is pollInterval. This determines how frequently the channel status is collected. You may want to have it collected at a different rate to the queue data, as it may be more expensive to extract the channel status. The default pollInterval is 0, which means that the channel status is collected every time Prometheus asks for the queue and queue manager statistics. Setting it to 1m means that a minimum time of one minute will elapse between asking for channel status even if the queue statistics are gathered more frequently.

A short-lived channel that connects and then disconnects in between collection intervals will leave no trace in the status or metrics.

Channel Metrics

A few of the responses from the DISPLAY CHSTATUS command have been selected as metrics. The key values returned are the status and number of messages processed.

The message count for SVRCONN channels is the number of MQI calls made by the client program.

There are actually two versions of the channel status returned. The channel_status metric has the value corresponding to one of the MQCHS_* values. There are about 15 of these possible values. There is also a channel_status_squash metric which returns one of only three values, compressing the full set into a simpler value that is easier to put colours against in Grafana. From this squashed set, you can readily see if a channel is stopped, running, or somewhere in between.

Channel Instances and Labels

Channel metrics are given labels to assist in distinguishing them. These can be displayed in Grafana or used as part of the filtering. When there is more than one instance of an active channel, the combination of channel name, connection name and job name will be unique (though see the z/OS section below for caveats on that platform).

The channel type (SENDER, SVRCONN etc) and the name of the remote queue manager are also given as labels on the metric.

Channel Dashboard Panels

The example Grafana dashboard shows how these labels and metrics can be combined to show some channel status. The Channel Status table panel demonstrates a couple of features. It uses the labels to select unique instances of channels. It also uses a simple number-to-text map to show the channel status as a word (and colour the cell) instead of a raw number.

The metrics for the table are selected and have '0' added to them. This may be a workround of a Grafana bug, or it may really be how Grafana is designed to work. But without that '+0' on the metric line, the table was showing multiple versions of the status for each channel. This table combines multiple metrics on the same line now.

Authentication

This monitor can be configured to authenticate to the queue manager, sending a userid and password.

The userid is configured using the -ibmmq.userid flag. The password can be set either by using the -ibmmq.password flag, or by passing it via stdin. That allows it to be piped from an external stash file or some other mechanism. Command line flags for controlling passwords are not recommended!

Configuring Prometheus

The Prometheus server has to know how to contact the MQ monitor. The simplest way is just to add a reference to the monitor in the server's configuration file. For example, by adding this block to /etc/prometheus/prometheus.yml.

  # Adding a reference to an MQ monitor. All we have to do is
  # name the host and port on which the monitor is listening.
  # Port 9157 is the reserved default port for this monitor.
  - job_name: 'ibmmq'
          scrape_interval: 15s

    static_configs:
          - targets: ['hostname.example.com:9157']

The server documentation has information on more complex options, including the ability to pull information on which hosts should be monitored from a variety of discovery tools.

Metrics

Once the monitor program has been started, and Prometheus refreshed to connect to it, you will see metrics being available in the prometheus console. All of the metrics are given the a prefix which by default is ibmmq. The name can be configured with the -namespace option on the command line.

More information on the metrics collected through the publish/subscribe interface can be found in the MQ KnowledgeCenter with further description in an MQDev blog entry

The queue and queue manager metrics shown in the Prometheus console are named after the descriptions that you can see when running the amqsrua sample program, but with some minor modifications to match the required style.

The channel metrics all begin with channel.

z/OS Support

Because the DIS QSTATUS and DIS CHSTATUS commands can be used on z/OS, the Prometheus monitor can now support showing some limited information from a z/OS queue manager. There is nothing special needed to configure it, beyond the client connectivity that allows an application to connect to the z/OS system.

The -ibmmq.useStatus parameter must be set to true to use the DIS QSTATUS command.

There is also support for using the RESET QSTATS command on z/OS. This needs to be explicitly enabled by setting the -ibmmq.resetQStats flag to true. While this option allows tracking of the number of messages put/got to a queue (which is otherwise unavailable from z/OS queue manager status queries), it should not be used if there are any other active monitoring solutions that are already using that command. Only one monitor program can reliably use RESET QSTATS on a particular queue manager, to avoid the information being split between them.

Channel Metrics on z/OS

On z/OS, there is no guaranteed way to distinguish between multiple instances of the same channel name. For example, multiple users of the same SVRCONN definition. On Distributed platforms, the JOBNAME attribute does that job; for z/OS, the channel start date/time is used in this package as a discriminator, and used as the jobname label in the metrics. That may cause the stats to be slightly wrong if two instances of the same channel are started simultaneously from the same remote address. The sample dashboard showing z/OS status includes counts of the unique channels seen over the monitoring period.

# README