I’ve been keeping an eye on Reconnoiter for a while now, and decided to start getting my feet wet.
Figuring out how to get the complex event processing working is a bit of a daunting task, as there is no documentation on how it works or how to use it with the Reconnoiter distribution. To understand what’s available, you have to start digging into the java code bits in the reconnoiter source tree and on the Esper documentation. To me this is the biggest barrier to adoption of Reconnoiter. Without monitoring and alerting, all you can do is graphing/trending which although slick you can get with other products (eg: Graphite) and doesn’t really showcase the power of Reconnoiter.
My goal is to put together basic functionality that handles the most common monitoring/alerting use cases. To me, this means
triggering events when a metric is unavailable (both in the case where noitd can’t retrieve the data from the target, and if noitd fails to deliver a metric to stratcond).(Included in queries below)Be able to trigger events when thresholds are over/under a given value for a metric.(Included in queries below)An alerting system that receives messages from the AMQP stream and acts intelligently with them. Typically this is where systems like Nagios have shined. I’ll start with just a simple AMQP to e-mail gateway and add more bells and whistles as we go (flap detection, hysteresis, maintenance windows, acknowledgements, etc…).Done! see below!- A dashboard with the current status of selected metrics.
Here’s what I have so far:
<queries master="iep">
<statement id="6cc613a4-7f9c-11de-973f-db7e8ccb2e5c" provides="CheckDetails-ddl">
<epl>create window CheckDetails.std:unique(uuid).win:keepall() as NoitCheck</epl>
</statement>
<statement id="76598f5e-7f9c-11de-9f5b-ebb4dcb2494e" provides="CheckDetails">
<requires>CheckDetails-ddl</requires>
<epl>insert into CheckDetails select * from NoitCheck</epl>
</statement>
<!--
Create a stream of checks that are unvailable (noitd tried to gather data and the
target didn't respond)
-->
<statement id="ba189f08-7f99-11de-9013-733772d37479" provides="UnavailableStream">
<requires>CheckDetails</requires>
<epl>insert into UnavailableStream
select p.* as delta, cds.target as target, cds.module as module,
cds.name as name, p.s.uuid as uuid
from pattern [ every
s=NoitStatus(availability='A') ->
( n0 = NoitStatus(uuid=s.uuid, availability='U')
and not NoitStatus(uuid=s.uuid, availability='A'))
].std:lastevent() as p
inner join CheckDetails as cds on cds.uuid = p.s.uuid
</epl>
</statement>
<!--
Create a stream of alerts on the noit.alerts.status AMQP topic of the
noit.alerts exchange.
-->
<query id="ce6bf8d2-3dd7-11de-a45c-a7df160cba9e" topic="status">
<epl>select * from NoitStatus</epl>
</query>
<!--
Emit the unavailable metrics on the noit.alerts.unvailable AMQP
topic of the noit.alerts exchange
-->
<query id="f4329df0-89a7-4299-ba0d-23caa51213ef" topic="unavailable">
<requires>UnvailableStream</requires>
<epl>select * from UnavailableStream</epl>
</query>
<!--
The following query will emit when a metric hasn't been seen in 5 minutes.
This is a bit of a big hammer, it's probably better to have queries specific to certain
metrics rather than complaining about everything at once. Also, it won't start
emitting until it's seen a metric at least once. I think. It will push messages
into the noit.alerts.out_to_lunch AMQP topic on the noit.alert exchange
-->
<query id="a4d8d60b-b6c9-473d-87a8-af54554ee05e" topic="out_to_lunch">
<epl>
select * from NoitMetricNumeric.std:groupwin(uuid,name).win:time(5 minutes).std:lastevent().std:size()
where size = 0 group by uuid, name
</epl>
</query>
<!--
This next statements start populating a window with data for a check with the uuid
1b4e28ba-2fa1-11d2-883f-b9b761bde3fb and looks at the metric average determining the
state on each check.
-->
<statement id="514d88d5-2212-42a8-bfdf-4d4fb4c148f1">
<epl>
insert into ThresholdChange
select *, case when value > 0.000821 then 'bad' else 'good' end as status from
NoitMetricNumeric(uuid='1b4e28ba-2fa1-11d2-883f-b9b761bde3fb',name='average')
</epl>
</statement>
<!-- The following query will trigger on each state change for the watched metrics -->
<query id="64368493-e8da-4ec6-925b-41616b49484b" topic="threshold">
<epl><![CDATA[
select tc.name as metric,tc.uuid as uuid,tc.noit as noit,tc.time as time,tc.value as value,
tc.status as status,cds.name as check,cds.module as module,cds.prefix as prefix,
cds.target as target from ThresholdChange.std:groupwin(uuid,name).win:length(2) tc,
CheckDetails cds where cds.uuid = tc.uuid and tc.status <> prev(tc.status)]]>
]]>
</epl>
</query>
</queries>
I’ll keep updating this page as I figure out how to get the various pieces glued together. Cheers!
Update: Here’s a simple threshold alerting daemon that listens for AMQP events and emails when they occur. It’s a quick hack proof of concept. Certainly by no stretch production ready, but it does work.
<?php
require_once "Mail.php";
$recon = new Reconnoiter_Threshold_Alerting_Service();
$recon->runServer();
class Reconnoiter_Threshold_Alerting_Service
{
private $_config = array();
private $_status = array();
private $_alerts = array();
private $_amqp;
private $_queue;
public function __construct()
{
$this->_config = json_decode(file_get_contents('config.json'), true);
$this->_amqp = new AMQPConnection(
array(
'host' => $this->_config['amqp']['host'],
'vhost' => $this->_config['amqp']['vhost'],
'port' => $this->_config['amqp']['port'],
'login' => $this->_config['amqp']['login'],
'password' => $this->_config['amqp']['password']
)
);
$this->_amqp->connect();
$this->_queue = new AMQPQueue($this->_amqp);
$this->_queue->declare($this->_config['amqp']['queue']);
$this->_queue->bind(
$this->_config['amqp']['exchange'],
$this->_config['amqp']['routing_key']
);
}
public function runServer()
{
do {
$msg = null;
$msg = $this->_queue->get();
if ($msg['count'] !== -1) {
//echo "Reveived message\n";
$payload = json_decode($msg['msg'],true);
$payload = $payload['threshold'];
if ($payload) {
if (isset($this->_config['checks'][$payload['uuid']][$payload['metric']])) {
$this->_status[$payload['uuid']][$payload['metric']] = $payload;
}
}
} else {
//echo "No message, checking status\n";
$this->_checkStatus();
}
} while ($msg['count'] !== -1 || sleep(1) === 0);
}
private function _checkStatus()
{
foreach ($this->_status as $uuid => $metrics) {
foreach ($metrics as $metric => $data) {
if ($data['status'] !== 'OK') { // In bad state
//echo "Status $uuid not OK!\n";
$time = time();
if (isset($this->_alerts[$uuid][$metric])) {
// Already in bad state
if (
($time - $this->_alerts[$uuid][$metric]['initial_alert']) >
$this->_config['checks'][$uuid][$metric]['grace_period']
) {
// Over grace period
if (
!isset($this->_alerts[$uuid][$metric]['last_alert']) ||
($time - $this->_alerts[$uuid][$metric]['last_alert']) >
$this->_config['global']['alert_freq']
) {
// Hit alert frequency
//echo "Sending alert for $uuid $metric\n";
$this->_alerts[$uuid][$metric]['sent_alert'] = true;
$this->_sendAlert($uuid, $metric);
$this->_alerts[$uuid][$metric]['last_alert'] = $time;
} else {
//echo "Not sending alert yet for $uuid $metric haven't hit alert frequency timeout yet\n";
}
}
} else {
//echo "Switch to bad state for $uuid $metric\n";
$this->_alerts[$uuid][$metric]['sent_alert'] = false;
$this->_alerts[$uuid][$metric]['initial_alert'] = $time;
}
} else { // Recovery
//echo "Status $uuid OK!\n";
if (
isset($this->_alerts[$uuid][$metric]) && $this->_alerts[$uuid][$metric]['sent_alert']
) {
//echo "Recovery for $uuid $metric\n";
$this->_sendAlert($uuid, $metric); // Recovery...
}
unset($this->_alerts[$uuid][$metric]);
}
}
}
return true;
}
private function _sendAlert($uuid, $metric)
{
$data = $this->_status[$uuid][$metric];
$data['description'] = $this->_config['checks'][$uuid][$metric]['description'];
$from = $this->_config['smtp']['from'];
$to = implode($this->_config['global']['contacts'],",");
$subject = sprintf(
"ReConThreshold: %s:%s:%s:%s:%s",
$data['target'],
$data['module'],
$data['check'],
$metric, $data['status']
);
$body = print_r($data, true);
$host = $this->_config['smtp']['host'];
$port = $this->_config['smtp']['port'];
$username = $this->_config['smtp']['username'];
$password = $this->_config['smtp']['password'];
$headers = array('From' => $from,'To' => $to,'Subject' => $subject);
$smtp = Mail::factory(
'smtp',
array(
'host' => $host,
'port' => $port,
'auth' => true,
'username' => $username,
'password' => $password
)
);
$mail = $smtp->send($to, $headers, $body);
if (PEAR::isError($mail)) {
//echo $mail->getMessage() . "\n";
} else {
//echo "Message successfully sent!\n";
}
}
}
And the configuration file:
{
"global":
{
"alert_freq":600,
"contacts":["someemail@example.com"]
},
"amqp":{
"host":"127.0.0.1",
"vhost":"/",
"port":5672,
"login":"noit",
"password":"noit",
"queue":"mytest",
"exchange":"noit.alerts",
"routing_key":"noit.alerts.threshold"
},
"checks":
{
"1b4e28ba-2fa1-11d2-883f-b9b761bde3fb":
{
"average":{"grace_period":10,"description":"Average round-trip time"}
}
},
"smtp":
{
"from":"email@gmail.com",
"host":"ssl://smtp.gmail.com",
"port":465,
"username":"email@gmail.com",
"password":"test"
}
}
So at this point,we’ve got all the basic pieces cobbled together for threshold alerting.
For each metric for which you wish to have threshold monitoring around with e-mail notification, you’ll need to add a query to the IEP configuration in stratcond that looks like this:
<statement id="<REPLACE WITH UNIQUE QUERY IDENTIFIER>">
<epl>
insert into ThresholdChange
select *, case when value >
<REPLACE WITH TRESHOLDVALUE> then 'WARNING' else 'OK' end as status from
NoitMetricNumeric(uuid='<REPLACE WITH CHECK IDENTIFIER>',
name='<REPLACE WITH CHECK METRIC NAME>')
</epl>
</statement>
Some examples:
<statement id="asdf1">
<epl>
insert into ThresholdChange
select *, case when value > 0.005 then 'WARNING' else 'OK' end as status from
NoitMetricNumeric(uuid='1b4e28ba-2fa1-11d2-883f-b9b761bde3fb',name='average')
</epl>
</statement>
<statement id="asdf2">
<epl>
insert into ThresholdChange
select *, case when (value > 0.005 and value > 0.008) then 'CRITICAL' else 'OK' end as status from
NoitMetricNumeric(uuid='1b4e28ba-2fa1-11d2-883f-b9b761bde3fb',name='average')
</epl>
</statement>
You may need to fiddle around a bit with the EPL statements. A good way to troubleshoot them is to run the IEP by hand (/usr/local/bin/run-iep.sh) and see the debugging output. Obviously if you’re familiar with SQL this shouldn’t be too hard. See here for more on EPL.
Basically anything that inserts into the ThresholdChange Esper table will trigger alerts to be sent over AMQP to the daemon when the state changes. The daemon will email when a configured item changes state (from OK to anything else) after grace_period amount of time has passed and the item is in a non-OK state, and will keep nagging every alert_freq seconds. Should the item recover, it will also send out an email. There’s probably a clever way to do all or most of this logic straight in Esper, but I don’t have mad Esper skills just yet.
To make the daemon work, you’ll need the Mail and Net_SMTP pear packages.