Panto is based on Agent/Server architecture
Server
The Server is the main component of the solution. It is responsible
for managing the configuration, communicating with the Agents, receiving and storing the
Results, testing the States and trigerring the Alerts.
These different functions can be spread on different hosts or be redundantly deployed, to allow
better scalability.
Agent
The Agent is a small application that launches the Probes and
sends the Results to the Server. The Agent is configured to register on a
Server, where it gets its full configuration.
In case of communication issues, the Agent spools the Results to be sent, until
the Server is reachable again.
There are two types of Agents:
- local, that means the Agent runs on the Target;
- remote, that means the Agent runs on the Server or on any remote location.
Target
A target is the thing you want to monitor. It can be anything, like:
- a host (bare metal Server, VPS, storage device, ...);
- a service (website, database cluster, ...);
- a network equipment (router, switch, firewall, ...);
- an application (API, website, ...).
We recommend the use of a unique FQDN to identify a
Target, but you can use any string,
as the
Target is identified by a UUID. Nevertheless, the
Target has to be
reachable by the
Agent. It can run an
Agent or it can receive data scrapping
requests from a remote
Agent.
From collecting metrics to triggering an Alert
Probe
A Probe is the specific process that tests a behaviour or gather metrics on a Target.
A Probe is always spawned by an Agent, whether it is a local or a remote Agent.
Result
The Result is the output of a Probe. It consists of a timestamp (when the Probe
was executed), and a dictionary of metrics with their values (int, float,
string or boolean).
It is stored in the timeseries database.
State
A
State is the classification of the
Result of a
Probe. It can have 6 different
values:
- Unknown: when the State can't be determined, either because there's no value,
or because the value can't be compared to configuration;
- OK: when the Result is considered as normal;
- Warning: when the Result is considered as worth attention;
- Critical: when the Result shows an issue that needs a reaction;
- Missing Data: when the Agent hasn't sent any Result for a while.
- Error: when a Probe had an error and failed to return a Result.
To determine the
State of a
Probe based on its
Result, the configuration has 3
different algorithms:
- Threshold: the value of one metric is compared to a fixed reference (e.g. free memory
is above 10%);
- Trend: the trend of the last X values of the metric is compared to a fixed reference
(e.g. the derivative of the count of HTTP/500 over the last 25 occurences is below 2);
- History: the value of one metric is compared to the value of the same metric taken Y
seconds ago (e.g. the count of successful checkouts compared to the same count 24 hours
ago).
Alert
An
Alert may occur when the
State of a check has been calculated from a
Result.
There are different algorithms to evaluate the
Alert conditions:
- State change: if the current State differs from the previous one (an option
can force the Alert to happen when the State is worse than the previous one);
- State recurrence: if the current State is not OK and lasts for more than X
occurences;
- State flap: if the Probe has State changing more than X times in the
last Y seconds;
When an
Alert occurs, it sends a notification to the contact, based on the configuration:
- an email;
- an SMS;
- a webhook (Slack, Flowdock, Mattermost...);
- an API (PagerDuty).