Here are the solutions for some known issues and pitfalls during the configuration process.
The dockerized plugins started by run.sh
won’t be killed by the monitoring daemon after the global timeout (service_check_timeout
). The daemon would kill run.sh
instead of the plugins process. Therefore we implemented a global minimum default of 120 seconds on the container-level, but with the ability to raise this in case of an explicitly set --timeout
higher than 120 seconds.
You have changed the store-dir from its default by setting –storedir=/var/my/dir on the command-line but the checks do not find the files in this dir.
Example:
$ ./run.sh check_netapp AutosizeMode -H filer --storedir=/tmp
No store for host 'filer' and object 'volume'. You may want to check the collector-checks. |
$ ls -l /tmp/filer/
-rw-r--r-- 1 ila wheel 83149 May 13 18:02 volume.store
...
The directory (/tmp
in the above example) exists two times: One of the two is on the host and the other inside of the container. These two directories have an equal path and name but are not the same.
Change the mapping in run.sh or create an additional one.
$ perl get_netapp_perfdata.pl -H sim812 --mode=7m -o lif
perf-object-counter-list-info failed: can't get instance names for lif
$ perl get_netapp_perfdata.pl -H sim821n1 --mode=cm -o ifnet
perf-object-instance-list-info-iter failed: Object "ifnet" was not found.
The perf-object lif is new in DataONTAP 8.2.1. It replaces ifnet,
which can be used for any filer running DataONTAP 8.1.x or older plus
7m-filers running even newer versions of DataONTAP. Therefore you can
use get_netapp_*.pl --object=ifnet
together with PerfIf for
cmode-filers older than 8.2.1 and for all 7m-filers.
The new get_netapp_*.pl --object=lif
together with PerfLif is only
for cluster-mode filers with DataONTAP 8.2.1 or later.
When installing / configuring collector-checks, the collector may run a bit later than the dependent check-scripts. This will temporarily (typically after having installed the checks) result in several UNKNOWNs. To avoid that you can run the collector from the command line before you restart Nagios.
The check-scripts --max_age
and the check_interval
of the collectors
must be configured in a meaningful way. E.g. if you set the checks
max_age to 2 minutes and collect every 30 minutes then the checks
will return with UNKNOWN most of the time.
For performance-collectors also the checks --delta
must fit into the
getters check_interval
. For more details see Configuring
Collector-Checks for
Performance-Checks
The collectors (a.k.a getters) collect and process the data in memory and if anything is here, the store-phase starts and all data is immediately written to disk.
The store-duration is monitored by the getter and printed as perf-data to stdout. E.g. store_duration=0.158s
Both getter and checks use file-locking, if supported by the underlying file-system.
If you want to get the latest result from a monitored device, you have to hit the "reschedule next service check" on both the responding collectors (may be more than one) and the check itself (in that order with a small delay in between!).
The plugin complains either with …
missing 'hostname' (environment-variable 'NAGIOS_HOSTNAME')
… or with …
Your monitoring-daemon did not set the required environment-variables (NAGIOS_SERVICEDESC, NAGIOS_COMMANDFILE)
If a check terminates on the command-line with one of the messages from above this is due to the rm_ack
feature, which needs these environment-variables set by Nagios. This is not a bug. Either ignore it on the command-line or set --rm_ack=off
.
The above means that if you are testing plugins on the command-line you can just ignore these warnings and run the same command again (which will mostly make the error disappear). Another solution for the command-line is to set --rm_ack=off
.
If you get such an error-message from your monitoring-system: Make sure
that it exports the required environment variables. Especially Icinga
uses ICINGA_HOSTNAME
etc. instead of NAGIOS_HOSTNAME
.
Newer versions of Nagios (and probably all other Nagios-compatible systems) do not export these variables by default anymore. They must be activated with enable_environment_macros
as described below.
These environment-variables are exported by the monitoring system
(e.g. Nagios, Icinga, ...) when they execute the check and tell us the
name of the host and the service-description. Host and
service-description are needed by the check to reset the correct service
acknowledgement in case of a reason change (--rm_ack
).
Some monitoring systems do not export these environment-variables by
default, or export them in a non-Nagios standard way, but instead have a
configuration-setting to change them. Please ask your monitoring-systems
support how to change these settings. In case you are going to search
for the setting yourself: In Nagios XI you would have to change
enable_environment_macros=1
(nagios.cfg). Please inform yourself about
the performance implications of changing this value!
For the curious: you cannot check that on the command line! Even not
after an su <monitoring-user>
! Again: They are exported by the
monitoring-daemon when the check is run.
This error occurs occasionally because these values are only needed in case of a reason-change which, by its nature, happens occasionally.
For more background I recommend reading the blog articles about rm_ack.
Setting --rm_ack=never
or --rm_ack=off
is not a generally
recommended solution, since it may result in masked alarms. But
there are situations, where these settings are acceptable. This blog
entry
may help to fully understand the possible consequences.
The output in the GUI contains HTML-tags like colors, <br> or </br>.
Configure cgi.cfg
so that HTML-tags are not escaped any more:
# ESCAPE HTML TAGS
# This option determines whether HTML tags in host and service
# status output is escaped in the web interface. If enabled,
# your plugin output will not be able to contain clickable links.
escape_html_tags=0
This could be the consequence of modules not in @INC. The following may help:
/usr/local/lib/perl/5.10.1$ sudo ln -s /usr/local/nagios/plugins/check_netapp_pro/ILanti/
/usr/local/lib/perl/5.10.1$ ls -l
drwxr-xr-x 19 root root 4096 2013-10-29 11:34 auto
lrwxrwxrwx 1 root root 46 2013-10-29 14:13 ILanti -> /usr/local/nagios/plugins/check_netapp_pro/ILanti/
-rw-rw-r-- 1 root root 5287 2013-10-29 11:34 perllocal.pod
drwxrwxr-x 2 root root 4096 2013-10-29 11:26 version
-r--r--r-- 1 root root 6619 2013-09-03 01:49 version.pm
-r--r--r-- 1 root root 9852 2013-08-16 14:55 version.pod
Check if the getter has been run within the last e.g. 15 Minutes
(depends on the delta you have set with --delta
). Also consider that
some getters depend on other getters data (e.g. vol_snapshot depend
on volume). See also section Collector Checks vs. Stand-Alone
Plugins
First of all: DataONTAP 7.x is not fully supported by check_netapp_pro, although some checks do work. E.g. Usage can check the aggregates-usage if you set the vserver-name to the host-name:
$ ./get_netapp_7m.pl -H 10.25.0.22 -o aggregate --explore
Existing data for object 'aggregate'
Node: 10.25.0.22
Instance: aggr0 (uuid n/a)
aggregate-name = aggr0
home-name = n/a
nodes = n/a
...
Explore done - now configure your nagios ...
$ ./check_netapp_pro.pl Usage -H 10.25.0.22 -o aggregate -s 10.25.0.22
NETAPP_PRO USAGE WARNING - 1 aggregate checked, 0 critical and 1 warning
aggr0: 842.1GiB (WARNING)
| aggr0=904163696640B;808456046182.4;1039443487948.8;0;1154937208832
$ ./get_netapp_cm.pl -H filer ... -o vol_snapshot No store for host 'filer' and object 'volume'. You may want to check the collector-checks.
The getter for the volume-snapshots depends on the getter for the volumes. So the following should work if typed in this order:
$ ./get_netapp_cm.pl -H filer ... -o volume $ ./get_netapp_cm.pl -H filer ... -o vol_snapshot
Under some configurations the getter for at least the snap-mirror and the snapmirror-destination object stop working because they see instances they have already collected.
Example error message: Instance ‘NETAPP01-SVM01:NETAPP01_SVM01_xxxx01_vol’ already exists – can not continue!
Background: https://blog.netapp-monitoring.info/2020/02/12/star-setup-stop-collectors/
Consider to use the --skip_duplicates
switch (introduced in v5.2.1).