Configuration Troubleshooting

Here are the solutions for some known issues and pitfalls during the configuration process.

Plugins Timeout

The dockerized plugins started by run.sh won’t be killed by the monitoring daemon after the global timeout (service_check_timeout). The daemon would kill run.sh instead of the plugins process. Therefore we implemented a global minimum default of 120 seconds on the container-level, but with the ability to raise this in case of an explicitly set --timeout higher than 120 seconds.

Storefiles are not found

Symptoms:

You have changed the store-dir from its default by setting –storedir=/var/my/dir on the command-line but the checks do not find the files in this dir.

Example:

$ ./run.sh check_netapp AutosizeMode -H filer --storedir=/tmp
No store for host 'filer' and object 'volume'. You may want to check the collector-checks. |

$ ls -l /tmp/filer/
-rw-r--r--  1 ila  wheel  83149 May 13 18:02 volume.store
...

Reason:

The directory (/tmp in the above example) exists two times: One of the two is on the host and the other inside of the container. These two directories have an equal path and name but are not the same.

Solution:

Change the mapping in run.sh or create an additional one.

perf-object-counter-list-info failed / perf-object-instance-list-info-iter failed

Symptoms:

$ perl get_netapp_perfdata.pl -H sim812  --mode=7m -o lif
perf-object-counter-list-info failed: can't get instance names for lif

$ perl get_netapp_perfdata.pl -H sim821n1 --mode=cm -o ifnet
perf-object-instance-list-info-iter failed: Object "ifnet" was not found.

Reason:

The perf-object lif is new in DataONTAP 8.2.1. It replaces ifnet, which can be used for any filer running DataONTAP 8.1.x or older plus 7m-filers running even newer versions of DataONTAP. Therefore you can use get_netapp_*.pl --object=ifnet together with PerfIf for cmode-filers older than 8.2.1 and for all 7m-filers.

The new get_netapp_*.pl --object=lif together with PerfLif is only for cluster-mode filers with DataONTAP 8.2.1 or later.

UNKNOWN or out of bound results from Collector-Checks

When installing / configuring collector-checks, the collector may run a bit later than the dependent check-scripts. This will temporarily (typically after having installed the checks) result in several UNKNOWNs. To avoid that you can run the collector from the command line before you restart Nagios.

The check-scripts --max_age and the check_interval of the collectors must be configured in a meaningful way. E.g. if you set the checks max_age to 2 minutes and collect every 30 minutes then the checks will return with UNKNOWN most of the time.

For performance-collectors also the checks --delta must fit into the getters check_interval. For more details see Configuring Collector-Checks for Performance-Checks

The collectors (a.k.a getters) collect and process the data in memory and if anything is here, the store-phase starts and all data is immediately written to disk. The store-duration is monitored by the getter and printed as perf-data to stdout. E.g. store_duration=0.158s

Both getter and checks use file-locking, if supported by the underlying file-system.

Reschedule Next Service Check

If you want to get the latest result from a monitored device, you have to hit the "reschedule next service check" on both the responding collectors (may be more than one) and the check itself (in that order with a small delay in between!).

Missing Environment Variables

Symptoms:

The plugin complains either with …

missing 'hostname' (environment-variable 'NAGIOS_HOSTNAME')

… or with …

Your monitoring-daemon did not set the required environment-variables (NAGIOS_SERVICEDESC, NAGIOS_COMMANDFILE)

Reasons and Solutions:

If a check terminates on the command-line with one of the messages from above this is due to the rm_ack feature, which needs these environment-variables set by Nagios. This is not a bug. Either ignore it on the command-line or set --rm_ack=off.

The above means that if you are testing plugins on the command-line you can just ignore these warnings and run the same command again (which will mostly make the error disappear). Another solution for the command-line is to set --rm_ack=off.

If you get such an error-message from your monitoring-system: Make sure that it exports the required environment variables. Especially Icinga uses ICINGA_HOSTNAME etc. instead of NAGIOS_HOSTNAME.

Newer versions of Nagios (and probably all other Nagios-compatible systems) do not export these variables by default anymore. They must be activated with enable_environment_macros as described below.

Background

These environment-variables are exported by the monitoring system (e.g. Nagios, Icinga, ...) when they execute the check and tell us the name of the host and the service-description. Host and service-description are needed by the check to reset the correct service acknowledgement in case of a reason change (--rm_ack).

Some monitoring systems do not export these environment-variables by default, or export them in a non-Nagios standard way, but instead have a configuration-setting to change them. Please ask your monitoring-systems support how to change these settings. In case you are going to search for the setting yourself: In Nagios XI you would have to change enable_environment_macros=1 (nagios.cfg). Please inform yourself about the performance implications of changing this value!

For the curious: you cannot check that on the command line! Even not after an su <monitoring-user>! Again: They are exported by the monitoring-daemon when the check is run.

This error occurs occasionally because these values are only needed in case of a reason-change which, by its nature, happens occasionally.

For more background I recommend reading the blog articles about rm_ack.

Setting --rm_ack=never or --rm_ack=off is not a generally recommended solution, since it may result in masked alarms. But there are situations, where these settings are acceptable. This blog entry may help to fully understand the possible consequences.

Escape HTML-Tags

Problem

The output in the GUI contains HTML-tags like colors, <br> or </br>.

Solution

Configure cgi.cfg so that HTML-tags are not escaped any more:

# ESCAPE HTML TAGS
# This option determines whether HTML tags in host and service
# status output is escaped in the web interface.  If enabled,
# your plugin output will not be able to contain clickable links.

escape_html_tags=0

CRITICAL exit and no output

This could be the consequence of modules not in @INC. The following may help:

/usr/local/lib/perl/5.10.1$ sudo ln -s /usr/local/nagios/plugins/check_netapp_pro/ILanti/

/usr/local/lib/perl/5.10.1$ ls -l

drwxr-xr-x 19 root root 4096 2013-10-29 11:34 auto
lrwxrwxrwx  1 root root   46 2013-10-29 14:13 ILanti -> /usr/local/nagios/plugins/check_netapp_pro/ILanti/
-rw-rw-r--  1 root root 5287 2013-10-29 11:34 perllocal.pod
drwxrwxr-x  2 root root 4096 2013-10-29 11:26 version
-r--r--r--  1 root root 6619 2013-09-03 01:49 version.pm
-r--r--r--  1 root root 9852 2013-08-16 14:55 version.pod

Store file (myfiler.volume) is out of date!

Check if the getter has been run within the last e.g. 15 Minutes (depends on the delta you have set with --delta). Also consider that some getters depend on other getters data (e.g. vol_snapshot depend on volume). See also section Collector Checks vs. Stand-Alone Plugins

Vserver and node-name are n/a on DataONTAP 7.x filer

First of all: DataONTAP 7.x is not fully supported by check_netapp_pro, although some checks do work. E.g. Usage can check the aggregates-usage if you set the vserver-name to the host-name:

$ ./get_netapp_7m.pl -H 10.25.0.22 -o aggregate --explore
Existing data for object 'aggregate'
Node: 10.25.0.22
     Instance: aggr0 (uuid n/a)
                aggregate-name = aggr0
                home-name = n/a
                nodes = n/a
                ...
Explore done - now configure your nagios ...

$  ./check_netapp_pro.pl Usage -H 10.25.0.22 -o aggregate -s 10.25.0.22
NETAPP_PRO USAGE WARNING - 1 aggregate checked, 0 critical and 1 warning
aggr0: 842.1GiB (WARNING)
 | aggr0=904163696640B;808456046182.4;1039443487948.8;0;1154937208832

vol_snapshot-Getter Depends on volume-Store

Problem:

$ ./get_netapp_cm.pl -H filer ... -o vol_snapshot No store for host 'filer' and object 'volume'. You may want to check the collector-checks.

Solution:

The getter for the volume-snapshots depends on the getter for the volumes. So the following should work if typed in this order:

$ ./get_netapp_cm.pl -H filer ... -o volume $ ./get_netapp_cm.pl -H filer ... -o vol_snapshot

Getter stops because of duplicate instances

Under some configurations the getter for at least the snap-mirror and the snapmirror-destination object stop working because they see instances they have already collected.

Example error message: Instance ‘NETAPP01-SVM01:NETAPP01_SVM01_xxxx01_vol’ already exists – can not continue!

Background: https://blog.netapp-monitoring.info/2020/02/12/star-setup-stop-collectors/

Solution:

Consider to use the --skip_duplicates switch (introduced in v5.2.1).