Monitoring Dynamics AX with Nagios

Since writing about how I monitor Dynamics AX batch processes from a more technical perspective, I have been asked how to set up monitoring for a complete Dynamics AX system.
Many years ago, Scott introduced me to Nagios, an open-source product designed to be able to monitor anything. At the office, we use it to monitor practically everything, and to fulfil Sarbanes-Oxley requirements, such as monitoring temperature or backup reliability. If we could, we'd monitor the coffee machine with Nagios.
Naturally since we've built our monitoring around this system, we use this to monitor our Dynamics AX environment too. Much of the information here could be adapted for monitoring other systems too, but the focus of this article will be simple monitoring for Dynamics AX.
Background
Firstly, there are some fundamentals that need to be covered. We run Nagios on a Linux server (running the Slackware Linux distribution). Nagios, in a nut-shell, simply runs commands designed to return the status and a single line of information per service or host. Nagios can cope with an enormous numbers these hosts, and each host contains many services. A more detailed explanation of core functionality can be found within the Nagios documentation itself.
We monitor availability of hosts (servers, network equipment and network appliances) by pinging, and specific traits of those hosts such as resource utilisation, temperature/health, connectivity, service availability. In many cases this extends to service operation, such as not just connecting to an LDAP service, but performing a query too.
Beyond this, we use Nagios’ ability to determine dependencies to highlight potential problems with other systems when one system becomes unavailable. We've also use the escalation configuration, where by on-going (or critical) faults escalate to sending an SMS to appropriate people using Štěpán Roh’s smsd and a Siemens MC35T GSM modem.
To monitor Windows servers, we use NRPE-NT, the Windows variant of the Nagios Remote Plug-in Executor. This simply runs as a windows service, and securely allows us to execute any service check plug-ins we like. All of these check “plug-ins” we use, including our own, are available on the Nagios Exchange website.
Configuring NRPE-NT for the server
Since the core Nagios system will actually be asking NRPE-NT on the server to execute check commands, let's examine the configuration of NRPE-NT first.
Firstly, you should install NRPE-NT on each server you're monitoring. Unzip the installation set from their website anywhere you like, and from the command-prompt run nrpe_nt -i
to install the service. Beyond that, everything is configured in the file “nrpe.cfg” (which you can simply edit with Notepad), but be aware that any changes to this file require you to restart the “Nagios Remote Plugin Executor for NT/W2K” service before they come into effect.
We monitor the server itself using the basic NRPE-NT plug-ins, which allows us to monitor simple resources such as CPU utilisation, memory, disk space, and individual services.
Here is an example basic configuration for a server:
server_port=5666
# If you have multiple addresses, set the listening address here
server_address=172.16.90.34
# For security, I recommend you set this. This defines the
# IP-address of your Nagios server. You can put more than one
# address here, separated by commas.
allowed_hosts=172.16.90.30,172.17.0.30
# This security feature disables accepting command arguments.
# Read the NRPE-NT manual very carefully before enabling this
# as there are security implications.
dont_blame_nrpe=0
# This enables debugging, which can generate a lot of logs.
# In a normal production environment, you would leave this off.
debug=0
# This defines how long NRPE-NT will wait for a command to run.
# If the command has not finished by the time this setting, in
# seconds, has elapsed, then NRPE-NT will kill the check command
# and return a bad status to Nagios.
command_timeout=30
# This is the command 'nt_check_disk_c' which checks the
# available storage disk space on drive 'C:'. We want warnings at
# 70% disk-usage, and critical alerts at 90% disk-usage.
command[nt_check_disk_c]=c:\nrpe_nt\Plugins\diskspace_nrpe_nt.exe c: 70 90
# For an SQL server, you'll need a few more of those lines!
#command[nt_check_disk_d]=c:\nrpe_nt\Plugins\diskspace_nrpe_nt.exe d: 70 90
#command[nt_check_disk_e]=c:\nrpe_nt\Plugins\diskspace_nrpe_nt.exe e: 70 90
# etc...
# This command checks CPU usage. Warnings at 50%, critical alerts
# at 80%. In multiple CPU/core systems, this is the total usage
# from all logical CPUs in the system.
command[nt_cpuload]=c:\nrpe_nt\Plugins\cpuload_nrpe_nt.exe 50 80
# This command checks memory load. Because of cache, it can be
# normal for seemingly high memory use. Because of this, we check
# for warnings at 95% and critical alerts at 99% memory
# utilisation.
command[nt_memload]=c:\nrpe_nt\Plugins\memload_nrpe_nt.exe 95 99
# I use terminal services to maintain the server, so let's
# monitor that service. As a convention, I name the command after
# the service's executable name, but the check plug-in looks at
# the name of the service as Windows displays it.
command[nt_service_termsvcs]=c:\nrpe_nt\Plugins\service_nrpe_nt.exe "Terminal Services"
Configuring NRPE-NT for the AOS
Both Axapta 3.0 and Dynamics AX 4.0 can be monitored quite closely. Via NRPE-NT, I monitor the service itself, along with monitoring the number of clients (Axapta 3.0) and sessions (Dynamics AX 4.0). Monitoring client/session counters is done through the wincheck_counter plug-in, which grabs data from Windows Performance Monitor counters.
Note that if you're running Dynamics AX 4.0 and you connect to the server using remote desktop, you won't see any instances available for AX. So long as you can see them in Performance Monitor when directly using the server's console, don't worry as the wincheck_counter plug-in will see them too.
If you're using Axapta 3.0, you can add the following to your NRPE-NT configuration:
command[nt_service_axaos]=c:\nrpe_nt\Plugins\service_nrpe_nt.exe "Axapta Object Server"
# This check returns the number of clients on the server.
# Here I'm referring to 'LIVE' as the server's instance name,
# with warnings when there are 40-clients, and critical alerts
# at 50-clients. You'll need to change this, obviously.
command[axaos_clients]=c:\nrpe_nt\Plugins\wincheck_counter.exe "Navision Axapta Object Server" -P "Clients" -f "%.0f clients online" -I "LIVE" -w 40 -c 50
If you're using Dynamics AX 4.0, add the following to your NRPE-NT configuration:
# instance called 'LIVE'. The double dollar-signs are
# intentional!
command[nt_service_axaos]=c:\nrpe_nt\Plugins\service_nrpe_nt.exe "Dynamics Server$$01-LIVE"
# This checks the number of sessions online. This includes
# clients, workers, servers, .NET/COM connectors, etc.
# Make sure the instance number matches the instance you're
# monitoring (here it's "01"). This line will generated
# warnings at 40-sessions, and critical alerts at 50-sessions.
command[axaos_clients]=c:\nrpe_nt\Plugins\wincheck_counter.exe "Microsoft Dynamics AX Object Server" -P "ACTIVE SESSIONS" -f "%.0f clients online" -I "01" -w 40 -c 50
Configuring NRPE-NT for the SQL server
Monitoring the SQL server is much of the same. You'll need different services, obviously, but you can also monitor the server for adverse performance conditions if you're adventurous.
Below is an simple example based on a Microsoft SQL Server 2005 installation that just monitors services.
# SQL agent is running.
command[nt_service_sqlservr]=c:\nrpe_nt\Plugins\service_nrpe_nt.exe "SQL Server (MSSQLSERVER)"
command[nt_service_sqlagent]=c:\nrpe_nt\Plugins\service_nrpe_nt.exe "SQL Server Agent (MSSQLSERVER)"
Configuring NRPE-NT with batch monitoring
To monitor batch processes in Dynamics AX 4.0, you can use my plug-in from my previous article. This requires the .NET Business Connector to be installed, and obviously configured to point to your live environment. For convenience, we run these on the same machine that runs the batch processor.
To make maintenance easier for myself, I name these after the class name in Dynamics AX, but keep in mind that there is a limit of 31–characters for a command's name in NRPE-NT, and any commands with longer names seem to disappear; Also be aware that the names are case-sensitive.
Here's an example, monitoring two of the most common batch jobs:
# This is based on running them every 5 minutes, therefore if the
# job has stalled we want warnings after 10 minutes (600 seconds)
# and critical alerts after an hour (3600 seconds). We check the
# jobs have run in the company 'foo'.
command[axbatch_EventJobCUD]=c:\nrpe_nt\Plugins\checkdaxbatch.exe EventJobCUD 600 3600 foo
command[axbatch_EventJobDueDate]=c:\nrpe_nt\Plugins\checkdaxbatch.exe EventJobDueDate 600 3600 foo
Configuring Nagios
I won't go into detail in configuring Nagios itself, since Nagios comes with its own manual and example configuration. I will show some basic elements you will need to configure to point you in the right direction.
Initially you'll need some command definitions to make calls to NRPE-NT on your servers. You'll need the standard Nagios plug-ins for the following to work:
# of $USER1$ is defined in the Nagios example config!
command {
command_name check_via_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
# When checking batch processes, sometimes it can take
# longer than expected. To simplify configuration, this
# command definition can be used which adds a prefix to
# the NRPE-NT command and increases the normal timeout.
command {
command_name check_axbatch_job
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c axbatch_$ARG1$ -t 60
}
# To check the RPC port for Dynamics AX/Axapta, this will
# attempt a simple TCP connection to the standard port.
command {
command_name check_axapta
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 2712
}
Beyond this, you'll need to define a host definition for each server:
define host {
host_name aos_server
alias AOS Server
address 172.16.90.34
# ...
# Many other definitions are missing. Please RTFM!
}
Each host contains several services which you'll also need to define. I always define several basic checks for the server in NRPE-NT, as you've seen above, so a server will usual contain these definitions:
define service {
host_name aos_server
service_description Memory
check_command check_via_nrpe!nt_memload
}
# Check CPU
define service {
host_name aos_server
service_description CPU
check_command check_via_nrpe!nt_cpuload
}
# Check disk C:
define service {
host_name aos_server
service_description Disk C:
check_command check_via_nrpe!nt_check_disk_c
}
# Check terminal services service
define service {
host_name aos_server
service_description RDP
check_command check_via_nrpe!nt_service_termsvcs
}
Obviously you'll need to extend these definitions or use templates, as described in the Nagios documentation.
Obviously, you can see the correlation between the service name defined in NRPE-NT and the check commands used in Nagios itself. Continue in this manner except for checking the RPC port on the AOS:
define service {
host_name aos_server
service_description RPC
check_command check_axapta
}
Obviously for the SQL server, you also continue the same trend, names changed.
Conclusion
Monitoring can become a complex beast, so I strongly recommend you read up on Nagios. Monitoring systems like this is worth the learning-curve and time invested because it's not a great idea to rely on users to report system down-time or potential problems.
Trackbacks
The author does not allow comments to this entry
Comments
Display comments as Linear | Threaded
John on :
Simon Butcher on :
Scott v2.0 on :
Simon Butcher on :
Wai Luen on :
Simon Butcher on :