check_netapp_ems (Event Management System log)

Checks the rate of specific events in the Event Management System log.

Usage

$ check_netapp_ems event-rate -H <host> [...] [--help] 

Description

This plugin reads the event log and counts the number of events within a given lookbehind-period.

A typical usage scenario is counting the number of autosize-events within the last hours. A high rate of such events could be interpreted as a sign for volumes getting too small.

Ontap Documentation

ONTAP 9.10.1 EMS Reference for Ontap 9.10.1

Important Parameters

--name Name of EMS events whose rate should get calculated. If omitted all events are counted. A string prefixed with a tilde (~string) is matched like a regular expression. See examples below.

--lookbehind Time-period for calculating the rate of matching EMS events. Must be a positive integer followed by a time-unit: s(econd), min(ute), h(our), d(day), w(eek). Defaults to 1h.

--rate Rate used for presenting the result in the message and the thresholds. Can be per_second, per_minute, per_hour, per_day or per_week.

--warning / --critical: Thresholds for the rate. The threshold is written as a pure number without any unit. The thresholds unit is taken from the --rate-parameter.

--verbose shows the list of events that are taken into account when calculating the rate.

Examples:

--rate=per_second --warning=3 → warns if more than 3 events per second

--rate=per_week --warning=3 → warns if more than 3 events per week

--severity=<string or regex> severity filter, count only events with a given log level. If prefixed with ~, it matches a regular expression. Both string and regex are not case sensitive. For example, --severity=ALERT can also be written as --severity=alert. See examples below.

The available severity-strings can be listed with ontap> event catalog show -severity

For all other parameters consult --help on the commandline.

Calculation

The lookbehind-period starts from the latest, matching event. All matching events within this period are added and divided trough the periods number of seconds. This rate is then recalculated according to --rate and finally displayed as events per time-unit in the checks output.

Examples

Simple Examples

./check_netapp_ems event-rate -H sim96
Rate of EMS events during the last hour: 4.45/minute
...

A first, probably not very useful example. It just calculates the number of events (any event!) per minute within the last hour.


./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done
Rate of wafl.vol.autoSize.done EMS events during the last hour: 0.01/minute
...

Monitors the number of wafl.vol.autoSize.done events.


./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day
Rate of wafl.vol.autoSize.done EMS events during the last hour: 14.40/day
...

Same as above but displays the rate as number of autosize-events per day.

The calculation is still based on the last hour (the default value for --lookbehind.) See the next example on how to change that.


./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day --lookbehind=1d
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...

Thresholds

./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day --lookbehind=1d --warning=10 --critical=20
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...

This will result in a WARNING state. The actual value of 13.82 is compared against 10 for warning and 20 for critical (with a greater-than operator).


./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day --lookbehind=1d --warning=5 --critical=1 --comparison=lt
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...

Probably not a useful example for this value, but it explains the principle well: This would lead to an OK status. The actual value of 13.82 is compared with a less-than operator with 5 for warning and 1 for critical.


Severity Filter (Loglevel)

./check_netapp_ems event-rate -H filer --rate=per_day --lookbehind=1d --warning=10 --critical=20 --severity=EMERGENCY
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...

Warn if more than 10 events with the highest level of EMERGENCY are found in the last 24h.


./check_netapp_ems event-rate -H filer --rate=per_day --lookbehind=1d --warning=10 --critical=20 --severity=~^(ALERT|EMERGENCY)$
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...

Same as above, but also consider ALERT-level messages. Please note the ~ in front of the regex!


Advanced Examples

Matching a Name (regex)

Using a regular expression (regex) allows to monitor similar but not exactly equal events. E.g. to monitor any raid event:

./check_netapp_ems event-rate -H sim96 --name="~^raid\." --rate=per_hour
Rate of ~^raid\. EMS events during the last hour: 157.27/hour

Using raid.rg.media_scrub will reduce that to counting media-scrub events only:

./check_netapp_ems event-rate -H sim96 --name="~^raid\.rg\.media_scrub" --rate=per_hour
Rate of ~^raid\.rg\.media_scrub EMS events during the last hour: 127.67/hour

And setting --name to a string (no tilde in front), will count only events whose name equals exactly:

./check_netapp_ems event-rate -H sim96 --name=raid.rg.media_scrub.done --rate=per_hour
Rate of raid.rg.media_scrub.done EMS events during the last hour: 37.44/hour

Using the ^ sign in front of the expression anchors it to the beginning of the text and assures that only events whose name starts with raid are counted. Omitting the ^ would make the regex match also a name like somtext.raider.somthing which is probably not what you intended.
Also do not forget to escape regex-active characters like the dot (would match any character). Especially on the commandline you should also quote the whole regex as seen in the example above.

./$ check_netapp_ems event-rate -H filer --name=netif.linkerrors --rate=per_hour --lookbehind=1h 

This will warn you, when the driver detects an excessive link error rate. Link errors are cyclic redundancy checks (CRC), runt frames, fragment, jabber, and alignment errors.

You may consider to change the --rate=per_second and reduce the --lookbehind.

For --name also a regex will work: --name="~^netif.linkerrors\." See also Matching a Name (regex) above.