Checks the rate of specific events in the Event Management System log.
$ check_netapp_ems event-rate -H <host> [...] [--help]
This plugin reads the event log and counts the number of events within a given lookbehind-period.
A typical usage scenario is counting the number of autosize-events within the last hours. A high rate of such events could be interpreted as a sign for volumes getting too small.
ONTAP 9.10.1 EMS Reference for Ontap 9.10.1
--name
Name of EMS events whose rate should get calculated. If omitted all events are counted. A string prefixed with a tilde (~string
) is matched like a regular expression. See examples below.
--lookbehind
Time-period for calculating the rate of matching EMS events. Must be a positive integer followed by a time-unit: s(econd), min(ute), h(our), d(day), w(eek). Defaults to 1h.
--rate
Rate used for presenting the result in the message and the thresholds. Can be per_second
, per_minute
, per_hour
, per_day
or per_week
.
--warning
/ --critical
: Thresholds for the rate. The threshold is written as a pure number without any unit. The thresholds unit is taken from the --rate
-parameter.
--verbose
shows the list of events that are taken into account when calculating the rate.
Examples:
--rate=per_second --warning=3
→ warns if more than 3 events per second
--rate=per_week --warning=3
→ warns if more than 3 events per week
--severity=<string or regex>
severity filter, count only events with a given log level. If prefixed with ~
, it matches a regular expression. Both string and regex are not case sensitive. For example, --severity=ALERT
can also be written as --severity=alert
. See examples below.
The available severity-strings can be listed with ontap> event catalog show -severity
For all other parameters consult --help
on the commandline.
The lookbehind-period starts from the latest, matching event. All matching events within this period are added and divided trough the periods number of seconds. This rate is then recalculated according to --rate
and finally displayed as events per time-unit in the checks output.
./check_netapp_ems event-rate -H sim96
Rate of EMS events during the last hour: 4.45/minute
...
A first, probably not very useful example. It just calculates the number of events (any event!) per minute within the last hour.
./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done
Rate of wafl.vol.autoSize.done EMS events during the last hour: 0.01/minute
...
Monitors the number of wafl.vol.autoSize.done events.
./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day
Rate of wafl.vol.autoSize.done EMS events during the last hour: 14.40/day
...
Same as above but displays the rate as number of autosize-events per day.
The calculation is still based on the last hour (the default value for --lookbehind
.) See the next example on how to change that.
./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day --lookbehind=1d
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...
./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day --lookbehind=1d --warning=10 --critical=20
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...
This will result in a WARNING state. The actual value of 13.82 is compared against 10 for warning and 20 for critical (with a greater-than operator).
./check_netapp_ems event-rate -H filer --name=wafl.vol.autoSize.done --rate=per_day --lookbehind=1d --warning=5 --critical=1 --comparison=lt
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...
Probably not a useful example for this value, but it explains the principle well: This would lead to an OK status. The actual value of 13.82 is compared with a less-than operator with 5 for warning and 1 for critical.
./check_netapp_ems event-rate -H filer --rate=per_day --lookbehind=1d --warning=10 --critical=20 --severity=EMERGENCY
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...
Warn if more than 10 events with the highest level of EMERGENCY are found in the last 24h.
./check_netapp_ems event-rate -H filer --rate=per_day --lookbehind=1d --warning=10 --critical=20 --severity=~^(ALERT|EMERGENCY)$
Rate of wafl.vol.autoSize.done EMS events during the last 24 hours: 13.82/day
...
Same as above, but also consider ALERT-level messages. Please note the ~
in front of the regex!
Using a regular expression (regex) allows to monitor similar but not exactly equal events. E.g. to monitor any raid event:
./check_netapp_ems event-rate -H sim96 --name="~^raid\." --rate=per_hour
Rate of ~^raid\. EMS events during the last hour: 157.27/hour
Using raid.rg.media_scrub will reduce that to counting media-scrub events only:
./check_netapp_ems event-rate -H sim96 --name="~^raid\.rg\.media_scrub" --rate=per_hour
Rate of ~^raid\.rg\.media_scrub EMS events during the last hour: 127.67/hour
And setting --name
to a string (no tilde in front), will count only events whose name equals exactly:
./check_netapp_ems event-rate -H sim96 --name=raid.rg.media_scrub.done --rate=per_hour
Rate of raid.rg.media_scrub.done EMS events during the last hour: 37.44/hour
Using the ^
sign in front of the expression anchors it to the beginning of the text and assures that only events whose name starts with raid are counted. Omitting the ^
would make the regex match also a name like somtext.raider.somthing
which is probably not what you intended.
Also do not forget to escape regex-active characters like the dot (would match any character). Especially on the commandline you should also quote the whole regex as seen in the example above.
./$ check_netapp_ems event-rate -H filer --name=netif.linkerrors --rate=per_hour --lookbehind=1h
This will warn you, when the driver detects an excessive link error rate. Link errors are cyclic redundancy checks (CRC), runt frames, fragment, jabber, and alignment errors.
You may consider to change the --rate=per_second
and reduce the --lookbehind
.
For --name
also a regex will work: --name="~^netif.linkerrors\."
See also Matching a Name (regex) above.