Note: While this article may still be relevant, there have been a number of updates since it was written in September 2015, including some rebranding. Some links and text have been updated to reflect those changes, and anyone intersted in this content is encouraged to check out the newer features of Checkmk by tribe29.
Checkmk (OMD) is an open source performance and fault monitoring tool based on Nagios core, capable of both agent-based and agent-free monitoring. A library of plug-ins is available to monitor many types of applications, but sometimes you might need to write your own local checks. The intent of this article is to guide you through the steps necessary to create and deploy a custom health check.
There are several reasons for to write your own checks:
- You want to define your check parameters in
WATOrather than locally on each target host.
- You want to exploit currently unused information already sent by an agent (for example, Windows' numerous performance counters)
- You want to implement SNMP based checks.
- You want your check to be easily ported to other installations of Checkmk.
- You want your check to become official part of Checkmk.
- You are simply interested in how Checkmk works.
So, what do you need to know?
First, local checks require that the Checkmk agent is installed on the host to be monitored (in most cases). The agent scans a local directory on the target host and executes any scripts it finds. By default this directory is in /usr/lib/check_mk_agent/local on Linux. On Windows, the scripts are run from the “local” folder at the same installation path of the check_mk_agent.exe file; you'll have to create that folder manually. On Linux, you can find the local checks path by running the following command:
check_mk_agent | head | grep Local LocalDirectory: /usr/lib/check_mk_agent/local
Next, you can write the health check in the programming language of your choice, so long as you satisfy the languages dependencies and the output requirements of Checkmk. Remember, the script is run on the host where it is deployed. This example will be written in BASH.
The agent expects a response in a timely manner. This means your check should finish before the next scheduled check, which is typically one or two minutes later by default. Don't fret though, as of version 1.2.3i1 there is a convention for checks that run longer. Under your local check folder, simply create a new folder with the number of seconds to cache your response. For our example, we installed the script in:
Finally, you need to output four columns of text, separated by a single space. The fourth column is the only one that allows additional spaces.
Column 1: Nagios status – 0 for OK, 1 for Warning, 2 for Critical, and 3 for Unknown.
Column 2: Check Name – This is the service name of the check.
Column 3: Performance data – You don't have to provide performance data, but if you do this is the column. You set the key/pair values here as well as the warn, crit, min and max values. Warn, crit, min and max values are optional.
Column 4: Check output, which will be displayed in the status detail column.
There are some limitations though. On Windows the output must only contain ASCII characters. UTF-8 and Unicode are not allowed. In versions before 1.1.5 you can only send one performance variable. Lastly, warning and critical levels have to be configured and handled on the host itself.
Here's an excerpt from our code success and fail options:
echo "$status IPSEndToEnd milliseconds=$finalMs;300000;600000|retries=$COUNT $statustxt - IPS End to End test successful"
Status check failed status=2 statustxt=CRITICAL echo "$status IPSEndToEnd milliseconds=$finalMs;300000;600000|retries=$COUNT $statustxt - IPS End to End test failed to complete"
Our use case required us to test the end-to-end health of our Intrusion Prevention product (IPS). The transactional nature of the check drove our decision to script this as a local check. We also decided it wasn't necessary to run the check on the target host because of the distributed nature of the application.
Our check will:
- Simulate an attempted intrusion against a client VM
- Track the response in a back-end database
- Provide “response time in milliseconds” and “number of retries” as performance data to Checkmk
- Cache the results for 30 minutes (1800 seconds) using the
/local/1800/script.sh This is where we use the convention from earlier to make checks run every 30 minutes instead of on the standard check intercal
Now, lets look at the code that puts this check to work.
First, we initialize some variables that will be used to setup our query:
#!/bin/bash currentTime=`/bin/date +%s` startSearch=$((currentTime * 1000)) endSearch=$((startSearch + 900000)) target="SOMEHOST" api_key=”0ur-Un1qu3-Gu1d-h3r3" host="some.database.ctl.io" query=”hostName%3a%24target+AND+created%3a%3c%24startSearch+TO+%24endSearch%3e” numRetries=60 port=1234 database="somewhereWeStoreData"
Next, we attack our target VM:
#Attack the test VM #echo "Attacking $target on port $port" echo "suspicous-string" | /usr/bin/nc $target $port
Now that we've setup the attack, we watch our database. We expect one or more new records to appear in our database sometime within 15 minutes of the attack, in a worst case scenario. This indicates that:
- The IPS detected the event
- Backend systems registered the event
- Notification was sent to the client
The code to watch the database for new messages is:
COUNT=0 while [ $COUNT -lt $numRetries ];do RESPONSE=`/usr/bin/curl -s 1 -i "https://$host/v1/$database?query=$query" -u "$api_key:" &2>/dev/null` echo "$RESPONSE" | grep -q "\"count\"\:[^0]," if [ $? -eq 0 ]; then finalDate=`/bin/date +%s` finalMs=$((finalDate * 1000 - startSearch)) status=0 statustxt=OK echo "$status IPSEndToEnd milliseconds=$finalMs;300000;600000|retries=$COUNT $statustxt - IPS End to End test successful" exit 0 fi COUNT=$((COUNT+1)) sleep 15 done # Status check failed status=2 statustxt=CRITICAL echo "$status IPSEndToEnd milliseconds=$finalMs;300000;600000|retries=$COUNT $statustxt - IPS End to End test failed to complete"
That's all there is to it, besides installing it in the agent's local folder. After a short time you should begin seeing performance data for the newly-configured check.
There are many other components of the Checkmk Project to consider. For more detailed information, visit the new official Homepage of Checkmk.