The Centreon Business Activity Monitoring (Centreon-BAM) module is a software tool for modelizing IT services and applications for end users. It reports on the status of a service in real time, tracking any changes, thereby allowing you to measure results against business-oriented service level agreements (SLAs) with internal or external users.
Centreon BAM aggregates raw sets of values, or indicators, against which to measure business performance. These key performance indicators (KPIs) collected by the Centreon monitoring system can be either a Centreon service, a logical rule between multiple services or any another business-centric monitoring indicator, called a Business Activity (BA).
You can use a BA as a KPI for another BA to create an impact tree and modelize the IT services or applications for analysis.
The evolution of a BA object will determine the quality of service (QoS) that reflects how well the application performs for its users. Based on this QoS rating, you can define the BA’s operating levels and the SLA.
If a BA fails, the malfunction(s) that led to the drop in QoS can be analyzed and the SLA failure diagnosed.
The following is an example of how Centreon BAM modelizes an application:
Building a BA and its related KPIs should be simple and methodical. Ideally, you should first include KPIs that are directly related to the overall function of the BA and then gradually add other KPIs that would have a potential impact on BA status.
All KPIs added to a BA must initially be monitored one at a time by the system to determine the operational status of the BA. They can then be incorporated and weighted to reflect the general state of the BA with a blocking, major, minor or null impact on the QoS of a Business Activity.
For instance, if a server ping fails, the impact is weighted as “blocking”, whereas a partition that is 98% full will only have a minor impact, since the service still functions.
This computational logic allows the QoS to reflect the actual status (Availability / Unavailability/ Warning state) much more accurately.
Here is an example of the logic for determining the QoS of an application:
- If QoS = 100%, the IT application/service is available.
- If QoS > 0% and QoS < 100, the IT application/service is in a Critical or at-risk state.
- Si QoS = 0, the IT application/service is unavailable.
The BA thresholds are:
- Warning threshold: 99.99%
- Critical threshold: 0%.
Actual cases differ and QoS thresholds should be adjusted accordingly. However, we recommend using these thresholds when creating your first BA.
The first step is to create a list of indicators for the BA and then sort them into several categories:
- KPIs known to have a blocking impact
- KPIs whose impact cannot be measured.
In this case, only be concerned with the KPI Critical states that have a blocking impact. This will make it easier to list and incorporate all the KPIs required for the IT service/application to function properly.
We recommend working with the other intermediate or critical states only after you have gained sufficient experience using the BA with the KPIs.
Calibrating KPIs and calculating the SLA¶
Using the product on a daily basis helps to follow the daily evolution of the QoS over time and understand the actual maximum threshold below which the application will cease to function.
Whenever an application operates in a Warning state or is unavailable, or during an update that has an negative impact on the application, or in case of an accident during production, a new KPI may have to be added. Also, existing KPIs with low impact should be reviewed in light of the new information available.
When it comes to adjusting the Warning and Critical thresholds of a BA, consider the example of a QoS curve oscillating between 80% and 100%, with an ultimately acceptable level of availability. Some components could cause a drop without really signifying a malfunction. The IT manager could then adjust the BA warning level downwards from 99.99% to 80%. This will unclutter the real-time BA monitoring screens and the Warning alert would then make much more sense.
Nevertheless, it is possible that even without a Critical alert, operation is unsatisfactory. The QoS would indicate the state as being Critical when it falls under 10%, in which case the Critical threshold of the BA should be raised from 0% to 10%.
This computational method refines the availability measurement and makes it more relevant and its QoS value more useful.
It is important not to associate the Warning and Critical thresholds of a BA with the SLA values.
The final value of the SLA is linked to the time spent in OK, Warning / Critical conditions (downtime/uptime), which are visible in the Reporting screens.
- Warning BA setting: 80%
- Critical BA setting: 10%
- 24/7 monitoring of indicators
- Over a 1-day period:
- QoS >= Warning = 23.5 hours (OK)
- QoS <= Critical = 10 minutes (Warning)
- QoS >= Critical = 20 minutes (Critical)
In this example, the SLA is not 80% or 10% but: * % Uptime and optimum performance ~ 97.916% (OK) * % Uptime ~ 0.694% (Warning) * % Uptime ~ 1.388% (Critical).