AIX metrics in Prometheus with N(J)MON

Ok. What? Why?

You can skip my intro and go straight to the "Technical solution" section.

Yes, let's start with that. AIX was born before I was. 1986. How is that relevant you might ask? Well, it happens people still use this, and AIX is actually somewhat pretty active used in 1000's of computer labs.

As with most things, even though we run on (sorry to those hurt by this term) "old-school" stuff, most people also want to use some new technologies or features. Such as monitoring.

And this is where Nigel Griffiths steps in with nmon.

nmon is short for Nigel's performance Monitor for Linux on POWER, x86, x86_64, Mainframe & now ARM (Raspberry Pi)

http://nmon.sourceforge.net/pmwiki.php?n=Main.HomePage

Kudo's to Nigel (aka Mr nmon) for his work and support on this project.

Note: I believe I should write nmon, but for readability I used NMON.

About NMON, NIMON and NJMON

Before I go into further details just a heads up regarding the different versions. I've already mentioned NMON. NIMON is the same, but sends the performance stats data straight to InfluxDB in Line Protocol format. NJMON is again the same, but sends the data as JSON output.

New meets old

As you might know, I do a lot of things with Kubernetes, so you might wonder how I end up with AIX. Well, long story short is that I'm (trying) to set up a pretty neat observability stack in Kubernetes with Prometheus and Thanos. Yet even though we can scrape k8s clusters, static targets with Telegraf, etc. - we always have some older stuff that really wants to join the cool club ;)

Just to be sure, I'm just kidding a little bit. To be honest it is quite important to be able to support older systems. Old does not perse mean bad. It's just sometimes more complicated.

Supported solution: NIMON - InfluxDB

So there is an easy solution to combine "old + new" - which is InfluxDB. This is supported by using NIMON. Yet I don't want to use InfluxDB. To be honest the arguments are somewhat simple:

  • With Prometheus I can scale better (when using Thanos or any other solution).
  • With integrating Thanos in Prometheus, we can have near unlimited retention for a very, very low cost.

I'm not saying InfluxDB is bad or anything, it just did not fit in the stack we wanted to use. Now I could opt for a different approach:

  • Run a Prometheus stack + InfluxDB stack - Both to be used as data sources in Grafana
    • I dislike this because it requires maintaining two stacks and I'm lazy. Seriously though - You want to focus on 1 thing if you can.
  • Retention costs for InfluxDB compared to Prometheus + Thanos
  • Run the Prometheus stack + InfluxDB stack - Yet scrape InfluxDB into Prometheus
    • It's becoming spaghetti this way. Just run.

Actual solution: Using Telegraf

With Telegraf you can do A LOT of cool things. You can see Telegraf as a swiss knife for processing stuff. It has numerous input and output plugins. It's a magic funnel.

With Telegraf we can do a few things:

And we can also output data. For instance acting like a Prometheus client: https://github.com/influxdata/telegraf/tree/master/plugins/outputs/prometheus_client

Push vs Pull

Before I go to the actual technical solution and on what plugins you can use:

It is essential to understand that InfluxDB is "PUSH" based and Prometheus "PULL" based. For InfluxDB we need to send data, and Prometheus scrapes data of the clients. It's somewhat easier to implement a push method rather than serving an endpoint that gets scraped.

NIMON can deal with InfluxDB by sending data, but there is nothing to be scraped.

Technical solution

I'm assuming you already have NJMON or NIMON running. If not, please do prepare this and use the official resources of NMON for how-to's: http://nmon.sourceforge.net/pmwiki.php?n=Site.Njmon

We want to install Telegraf, I could copy/paste this but I would recommend also following the official guidelines here: https://docs.influxdata.com/telegraf/v1.14/introduction/installation/ -_ I made a reference to the latest Telegraf version. If you are reading this in end 2020 or later: please check for the latest version or updates yourself. _

  • It is not required to install Telegraf on the same server as NMON - Read more about this in the "About" sections at the end of this post.
  • I'm using NJMON myself, but it should work with NIMON too. Again read the "about" section.

Now with either NIMON or NJMON and Telegraf present, we can start configuring both (again; I'm using NJMON in my examples).

Configuring Telegraf

With a default Telegraf install we can find our config files in /etc/telegraf

  • We can remove/move the telegraf.conf in this directory
  • Create an empty telegraf.conf (or just > telegraf.conf
  • Copy the following contents and write/quit:
1      [[inputs.socket_listener]]
2        service_address = "tcp://:8080"
3        data_format = "json"
4
5      [[outputs.prometheus_client]]
6        listen = ":9273"

Explanation:

We use the plugin: inputs.socket_listener. I will go into more detail about that later. For this listener, we say tcp:// - which means over TCP and :8080 - which means that we will listen on port 8080. You are free to use whatever port you like (as long as it is a free port ;) )

The data_format is pretty important. It defines what type of data to listen for (and therefor parse). In our case it is json - IF we use NMON, this data_format should be influxdb

For our output we are using the prometheus client with the plugin outputs.prometheus_client

You can see this output plugin as an HTTP page. A page with metrics, that is visitable via (in this case) :9273/metrics. Not much more to say about this - as I told you: its a magic funnel. Though for more details just visit the GitHub URL's I've posted for each plugin in the Telegraf section above.

Configuring NJMON

And this is fairly easy. We pass on the following parameters: -i localhost -p 8080 and thats it. A full command would be: njmon -s 30 -c 2880 -i localhost -p 8080

Explanation:

With -i we define the host, to where we should push our data to. In this case our Telegraf agent. The -p is the port, I've used 8080 in my examples. Change this accordingly.

Result & checking the data

When everything is done, we can issue the njmon command or restart it via a service. We also issue a telegraf restart. Depending on your system it could be something like service telegraf restart

When checking the status of the Telegraf service you should see something along the lines of:

1[outputs.prometheus_client] Listening on http://[::]:9273/metrics
2 [inputs.socket_listener] Listening on tcp://[::]:8080

If we do a CURL request to our Telegraf scrape endpoint we should see metrics:

curl localhost:9273/metrics

 1root@vpnetje:~# curl localhost:9273/metrics
 2# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
 3# TYPE go_gc_duration_seconds summary
 4go_gc_duration_seconds{quantile="0"} 1.0266e-05
 5go_gc_duration_seconds{quantile="0.25"} 1.1672e-05
 6go_gc_duration_seconds{quantile="0.5"} 1.2621e-05
 7go_gc_duration_seconds{quantile="0.75"} 1.3616e-05
 8go_gc_duration_seconds{quantile="1"} 7.8672e-05
 9go_gc_duration_seconds_sum 0.000769818
10go_gc_duration_seconds_count 52
11*SNIP*
12etc

NIMON vs NJMON

I've gone to the NJMON path. Both should be somewhat the same, yet I noticed somewhat inconsistency with the data. I did not test this thoroughly but these should be your options IF you want to use NIMON over NJMON

  • Use influxdb_listener with data_type "influx"
  • Use socket_listener with data_type "influx"

More details about data_types in Telegraf here: https://github.com/influxdata/telegraf/tree/master/plugins/parsers/influx & https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md

Possible options with Prometheus

Nigel has made a nice picture explaining the different setups that are possible with Prometheus. Note the word Prometheus, as there are various other options to use NIMON/NJMON with. These are intentionally not covered here as we focus on Prometheus.

("influxdb.png")

About scrape timings

This will become somewhat advanced quickly so I wont go into details about it in this post but when playing with the settings you should think about the following:

  • You've got 2 applications "parsing" data periodically
  • When the timings are off, you could end up with double scraping times (i.e. less interval)
  • I would recommend setting the Prometheus scrape interval x1.5 / x2 times the interval for NMON

About Grafana Dashboards

Any existing dashboards for Grafana for AIX metrics via NMON do not work with Prometheus. Very simply put: All the dashboards are made with InfluxDB as data source and also query InfluxDB. When using metrics in Prometheus, it is just different. In other words: You'll have to rewrite those.

About a dedicated Telegraf "server"

In my example we install Telegraf on the same server as NMON - This is not required. We can setup a dedicated machine for Telegraf (or just a machine that is not AIX :D ) and make sure we got some firewall rules set for the ports we use. In NMON we can define the host with the -i param and just as easily push the data to a remote Telegraf agent.

comments powered by Disqus