Approach

Panto is based on Agent/Server architecture

Server

The Server is the main component of the solution. It is responsible for managing the configuration, communicating with the Agents, receiving and storing the Results, testing the States and trigerring the Alerts.
These different functions can be spread on different hosts or be redundantly deployed, to allow better scalability.

Agent

The Agent is a small application that launches the Probes and sends the Results to the Server. The Agent is configured to register on a Server, where it gets its full configuration.
In case of communication issues, the Agent spools the Results to be sent, until the Server is reachable again.
There are two types of Agents:

  • local, that means the Agent runs on the Target;
  • remote, that means the Agent runs on the Server or on any remote location.

Target

A target is the thing you want to monitor. It can be anything, like:

  • a host (bare metal Server, VPS, storage device, ...);
  • a service (website, database cluster, ...);
  • a network equipment (router, switch, firewall, ...);
  • an application (API, website, ...).

We recommend the use of a unique FQDN to identify a Target, but you can use any string, as the Target is identified by a UUID. Nevertheless, the Target has to be reachable by the Agent. It can run an Agent or it can receive data scrapping requests from a remote Agent.

From collecting metrics to triggering an Alert

Probe
A Probe is the specific process that tests a behaviour or gather metrics on a Target. A Probe is always spawned by an Agent, whether it is a local or a remote Agent.
Result
The Result is the output of a Probe. It consists of a timestamp (when the Probe was executed), and a dictionary of metrics with their values (int, float, string or boolean).
It is stored in the timeseries database.
State
A State is the classification of the Result of a Probe. It can have 6 different values:
  • Unknown: when the State can't be determined, either because there's no value, or because the value can't be compared to configuration;
  • OK: when the Result is considered as normal;
  • Warning: when the Result is considered as worth attention;
  • Critical: when the Result shows an issue that needs a reaction;
  • Missing Data: when the Agent hasn't sent any Result for a while.
  • Error: when a Probe had an error and failed to return a Result.

To determine the State of a Probe based on its Result, the configuration has 3 different algorithms:
  • Threshold: the value of one metric is compared to a fixed reference (e.g. free memory is above 10%);
  • Trend: the trend of the last X values of the metric is compared to a fixed reference (e.g. the derivative of the count of HTTP/500 over the last 25 occurences is below 2);
  • History: the value of one metric is compared to the value of the same metric taken Y seconds ago (e.g. the count of successful checkouts compared to the same count 24 hours ago).

Alert
An Alert may occur when the State of a check has been calculated from a Result. There are different algorithms to evaluate the Alert conditions:
  • State change: if the current State differs from the previous one (an option can force the Alert to happen when the State is worse than the previous one);
  • State recurrence: if the current State is not OK and lasts for more than X occurences;
  • State flap: if the Probe has State changing more than X times in the last Y seconds;

When an Alert occurs, it sends a notification to the contact, based on the configuration:
  • an email;
  • an SMS;
  • a webhook (Slack, Flowdock, Mattermost...);
  • an API (PagerDuty).