Panto is based on Agent/Server architecture
The Server is the main component of the solution. It is responsible
for managing the configuration, communicating with the Agents, receiving and storing the
Results, testing the States and trigerring the Alerts.
These different functions can be spread on different hosts or be redundantly deployed, to allow
The Agent is a small application that launches the Probes and
sends the Results to the Server. The Agent is configured to register on a
Server, where it gets its full configuration.
In case of communication issues, the Agent spools the Results to be sent, until
the Server is reachable again.
There are two types of Agents:
- local, that means the Agent runs on the Target;
- remote, that means the Agent runs on the Server or on any remote location.
A target is the thing you want to monitor. It can be anything, like:
- a host (bare metal Server, VPS, storage device, ...);
- a service (website, database cluster, ...);
- a network equipment (router, switch, firewall, ...);
- an application (API, website, ...).
We recommend the use of a unique FQDN to identify a Target
, but you can use any string,
as the Target
is identified by a UUID. Nevertheless, the Target
has to be
reachable by the Agent
. It can run an Agent
or it can receive data scrapping
requests from a remote Agent
From collecting metrics to triggering an Alert
A Probe is the specific process that tests a behaviour or gather metrics on a Target.
A Probe is always spawned by an Agent, whether it is a local or a remote Agent.
The Result is the output of a Probe. It consists of a timestamp (when the Probe
was executed), and a dictionary of metrics with their values (int, float,
string or boolean).
It is stored in the timeseries database.
is the classification of the Result
of a Probe
. It can have 6 different
- Unknown: when the State can't be determined, either because there's no value,
or because the value can't be compared to configuration;
- OK: when the Result is considered as normal;
- Warning: when the Result is considered as worth attention;
- Critical: when the Result shows an issue that needs a reaction;
- Missing Data: when the Agent hasn't sent any Result for a while.
- Error: when a Probe had an error and failed to return a Result.
To determine the State
of a Probe
based on its Result
, the configuration has 3
- Threshold: the value of one metric is compared to a fixed reference (e.g. free memory
is above 10%);
- Trend: the trend of the last X values of the metric is compared to a fixed reference
(e.g. the derivative of the count of HTTP/500 over the last 25 occurences is below 2);
- History: the value of one metric is compared to the value of the same metric taken Y
seconds ago (e.g. the count of successful checkouts compared to the same count 24 hours
may occur when the State
of a check has been calculated from a Result
There are different algorithms to evaluate the Alert
- State change: if the current State differs from the previous one (an option
can force the Alert to happen when the State is worse than the previous one);
- State recurrence: if the current State is not OK and lasts for more than X
- State flap: if the Probe has State changing more than X times in the
last Y seconds;
When an Alert
occurs, it sends a notification to the contact, based on the configuration:
- an email;
- an SMS;
- a webhook (Slack, Flowdock, Mattermost...);
- an API (PagerDuty).