Troubleshooting

Plugins Timeout

The dockerized plugins started by run.sh won’t be killed by the monitoring daemon after the global timeout (service_check_timeout). Instead of the plugins process the daemon would kill run.sh. Therefore we implemented a global minimum default of 120 seconds on the container-level but with the ability to get raised in case of an explicitly set --timeout higher than 120 seconds.

Storefiles are not found

Symptoms:

You have changed the store-dir from its default by setting –storedir=/var/my/dir on the command-line but the checks do not find the files in this dir.

Example:

$ ./run.sh check_netapp AutosizeMode -H filer --storedir=/tmp
No store for host 'filer' and object 'volume'. You may want to check the collector-checks. |

$ ls -l /tmp/filer/
-rw-r--r--  1 ila  wheel  83149 May 13 18:02 volume.store
...

Reason:

The directory (/tmp in the above example) exists two times: One of the two is on the host and the other inside of the container. These two directories have an equal path and name but are not the same.

Solution:

Change the mapping in run.sh or create an additional one.

perf-object-counter-list-info failed / perf-object-instance-list-info-iter failed

Symptoms:

$ perl get_netapp_perfdata.pl -H sim812  --mode=7m -o lif
perf-object-counter-list-info failed: can't get instance names for lif

$ perl get_netapp_perfdata.pl -H sim821n1 --mode=cm -o ifnet
perf-object-instance-list-info-iter failed: Object "ifnet" was not found.

Reason:

The perf-object lif is new in DataONTAP 8.2.1. It replaces ifnet, which can be used for any filer running DataONTAP 8.1.x or older plus 7m-filers running even newer versions of DataONTAP. Therefore you can use get_netapp_*.pl --object=ifnet together with PerfIf for cmode-filer older than 8.2.1 and for all 7m-filer.

The new get_netapp_*.pl --object=lif together with PerfLif is only for cluster-mode filers with DataONTAP 8.2.1 or later.

UNKNOWN or out of bound results from Collector-Checks

When installing / configuring collector-checks, the collector may run a bit later than the depending check-scripts. This will temporarily (typically after having installed the checks) result in several UNKNOWNs. To avoid that you can run the collector from the command line before you restart Nagios.

The check-scripts --max_age and the check_interval of the collectors must be configured in a meaningful way. E.g. if you set the checks max_age to 2 minutes and collect every 30 minutes then the checks will return with UNKNOWN most of the time.

For performance-collectors also the checks --delta must fit into the getters check_interval. For more details see Configuring Collector-Checks for Performance-Checks

The collectors (a.k.a getters) collect and process the data in memory and if anything is here, the store-phase starts and all data is immediately written to disk. The store-duration is monitored by the getter and printed as perf-data to stdout. E.g. store_duration=0.158s

Both getter and checks use file-locking, if supported by the underlying file-system.

Reschedule Next Service Check

If you want to get the latest result from a monitored device, you have to hit the "reschedule next service check" on both the responding collectors (may be more than one) and the check itself (in that order with a small delay in between!).

missing 'hostname' (environment-variable 'NAGIOS_HOSTNAME')

If a check terminates on the command-line with missing 'hostname' (as argument or environment-variable 'NAGIOS_HOSTNAME') this is due to the rm_ack feature, which needs these environment-variables set by Nagios. This is not a bug. Either ignore it on the command-line or set --rm_ack=never.

If you get such an error-message from your monitoring-system: Make sure that it exports the required environment variables. Especially Icinga uses ICINGA_HOSTNAME etc. instead of NAGIOS_HOSTNAME.

Background

These environment-variables are exported by the monitoring system (e.g. Nagios, Icinga, ...) when they execute the check and tell us the name of the host and the service-description. Host and service-description are needed by the check to reset the correct service acknowledgement in case of a reason change (--rm_ack).

Some monitoring systems do not export these environment-variables by default, or export them in a non-Nagios standard way, but instead have a configuration-setting to change them. Please ask your monitoring-systems support how to change these settings. In case you are going to search for the setting yourself: In Nagios XI you would have to change enable_environment_macros=1 (nagios.cfg). Please inform yourself about the performance implications of changing this value!

For the curious: you can not check that on the command line! Even not after a su <monitoring-user>! Again: They are exported by the monitoring-daemon when the check is run.

This error occurs occasionally because these values are only needed in case of a reason-change, which - by its nature - happens occasionally.

For more background I recommend reading the blog articles about rm_ack.

Setting --rm_ack=never or --rm_ack=off is not a generally recommended solution, since it may result in masked alarms. But there are situations, where these settings are acceptable. This blog entry may help to fully understand the possible consequences.

Escape HTML-Tags

Problem

The output in the GUI contains HTML-tags like colors, <br> or </br>.

Solution

Configure cgi.cfg so that HTML-tags are not escaped any more:

# ESCAPE HTML TAGS
# This option determines whether HTML tags in host and service
# status output is escaped in the web interface.  If enabled,
# your plugin output will not be able to contain clickable links.

escape_html_tags=0

CRITICAL exit and no output

Could be the consequence of modules not in @INC. The following may help:

/usr/local/lib/perl/5.10.1$ sudo ln -s /usr/local/nagios/plugins/check_netapp_pro/ILanti/

/usr/local/lib/perl/5.10.1$ ls -l

drwxr-xr-x 19 root root 4096 2013-10-29 11:34 auto
lrwxrwxrwx  1 root root   46 2013-10-29 14:13 ILanti -> /usr/local/nagios/plugins/check_netapp_pro/ILanti/
-rw-rw-r--  1 root root 5287 2013-10-29 11:34 perllocal.pod
drwxrwxr-x  2 root root 4096 2013-10-29 11:26 version
-r--r--r--  1 root root 6619 2013-09-03 01:49 version.pm
-r--r--r--  1 root root 9852 2013-08-16 14:55 version.pod

Store file (myfiler.volume) is out of date!

Check if the getter has been run within the last e.g. 15 Minutes (depends on the delta you have set with --delta). Also consider that some getters depend on other getters data (e.g. vol_snapshot depend on volume). See also section Collector Checks vs. Stand-Alone Plugins

Vserver and node-name are n/a on DataONTAP 7.x filer

Fist of all: DataONTAP 7.x is not fully supported by check_netapp_pro. But some checks work. E.g. Usage can check the aggregates-usage if you set the vserver-name to the host-name:

$ ./get_netapp_7m.pl -H 10.25.0.22 -o aggregate --explore
Existing data for object 'aggregate'
Node: 10.25.0.22
     Instance: aggr0 (uuid n/a)
                aggregate-name = aggr0
                home-name = n/a
                nodes = n/a
                ...
Explore done - now configure your nagios ...

$  ./check_netapp_pro.pl Usage -H 10.25.0.22 -o aggregate -s 10.25.0.22
NETAPP_PRO USAGE WARNING - 1 aggregate checked, 0 critical and 1 warning
aggr0: 842.1GiB (WARNING)
 | aggr0=904163696640B;808456046182.4;1039443487948.8;0;1154937208832

vol_snapshot-Getter Depends on volume-Store

Problem:

$ ./get_netapp_cm.pl -H filer ... -o vol_snapshot No store for host 'filer' and object 'volume'. You may want to check the collector-checks.

Solution:

The getter for the volume-snapshots depends on the getter for the volumes. So the following should work if typed in this order:

$ ./get_netapp_cm.pl -H filer ... -o volume $ ./get_netapp_cm.pl -H filer ... -o vol_snapshot

Getter stops because of duplicate instances

Under some configurations the getter for at least the snap-mirror and the snapmirror-destination object stop working because they see instances they have already collected.

Example error message: Instance ‘NETAPP01-SVM01:NETAPP01_SVM01_xxxx01_vol’ already exists – can not continue!

Background: https://blog.netapp-monitoring.info/2020/02/12/star-setup-stop-collectors/

Solution:

Consider to use the switch --skip_duplicates (introduced in v5.2.1).

DiskPaths: Storefile not found or out of date

Problem:

Although having a running getter and an up-to-date storest file for the disk-object in place you see this error message:

No store (type: store) for host 'filer' and object 'disk'. You may want to check the collector-checks.

or

Store file (filer.disk) is out of date!

Solution:

The DiskPaths check is not supported for all ONTAP versions and requires a dedicated ZAPI-getter to collect its data. Please consult the documentation of DiskPaths check (--help) for further details.

Volume getter does not get all volumes

Problem:

Using the newer, universal getter get_netapp with the volume object, returns sometimes less volumes than the older Perl-based get_netapp_cm.pl -o volume.

Reason:

The modern RESTful API does not returns all volumes if the official volume endpoint is used. The universal get_netapp will use the rest-API in favour of the older ZAPI by default.

Solution:

As long as the ONTAP version still supports the ZAPI you can force the getter to use it with --api=zapi.

Please consider a short report to us if you think that you need the volumes not returned by the official RESTful API-endpoint. We are aware of an other, private endpoint which could return a complete list.

Authentication errors

Problem:

One or both of the following occurs with an user-account:

  • The unigetter (get_netapp) returns with authentication error. Please check the credentials (–user, –pass or –authfile) although the credentials are proven ok.

  • Some getter return 0 instances although there would be instances on the filer available.

These effects disappear if you use an admin account (--user=admin --pass=<admin-password>).

Explanation:

The api-detection is confused because of missing or otherwise wrong configuration of the monitoring user on the filer.

Workaround:

Disable the api-detection and explicitly set the api (e.g. --api=rest --disable_api_detection)

Solution:

Check your filers configuration (see Typescript Monitoring User for cdot). If you find a capability missing in the typescript please report it together with the Ontap version to the developers.