Quick start

Foreword

The Centreon Business Activity Monitoring module (Centreon-BAM) aims to define a new “business-centric” monitoring indicator aggregated from unitary indicators (KPI) collected by the monitoring system. This new object is called a Business Activity (BA).

The evolution of a BA object will determine a quality of service (QoS), thus reflecting the level of quality delivered by the application to its users. Based on this QoS rating, we can define the BA’s operating levels and thus the service level agreement (SLA).

If the BA fails, one can analyse the malfunctions that led to the fall in the QoS and by extension the reduction in the SLA.

Initial configuration

A BA and its related KPIs must be built in a simple manner and in steps. Ideally, one should start with including the most obvious KPIs (those that are actually directly related to the general functioning of the BA), then progressively add those that have a potential impact on the overall functioning.

All KPIs added to the BA must be initially monitored one by one by the monitoring system to know their operational status. They can then be added to the BAs and weighted to reflect the general state of the BA. These weighting can have a “blocking”, “major”, “minor” or “null” effect on the BA’s QoS.

For instance, if a server ping fails, the related weighting is “blocking”, whereas a 98% full partition will only have a “minor” impact.

Thanks to this computational logic, the resulting QoS will reflect actual availability / unavailability much more accurately.

If QoS > 0%, SLA = Available. Conversely, if QoS = 0%, then SLA = unavailable. So the warning levels must by default be as follows:

  • Warning threshold: 99.99%;

  • Critical threshold: 0%.

Some cases are naturally different, and the QoS threshold would also be different. However, generally speaking, these must be applied when creating our first BA.

Implementation method

The first point is to create a list of indicators making up the BA then sort them into several categories:

The key indicators that are known to have a blocking impact

The key indicators that we do not really know how to measure

Then deal only with the “Ok ” and ” Critical ” states of the KPIs, with the ” Null ” or ” Blocking ” impacts.

**Notice: ** Other intermediate states or impacts could be used later, when the user has a good grasp of how the BA functions though its key indicators.

Calibrating KPIs and calculating the SLA

Subsequently, the real changes in QoS, and the real QoS threshold below which the application is no longer operational , will only be noticeable in the daily use of the product over time.

In case of major situations where the application is in a critical state or even unavailable, it would be time to add new KPIs, or aggregate those that were “Null” until now, based on new knowledge.

Regarding the adjustment of the BA’s warning and critical thresholds, let’s take the example of a QoS curve oscillating between 80% and 100%, ultimately with a quite acceptable level of availability. Some components could cause drops without these being really representative of a malfunction. The director could then adjust the BA’s warning level downwards from 99.99% to 80%.

This will make the real-time BA monitoring screens less cluttered. The warning alert would then make much more sense. Likewise, one may subsequently realize that the functioning is unsuitable, without the critical alert. The observed QoS may show that a state is critical when it is under 10%, in which case the BA’s critical threshold should be raised from 0% to 10%.

This computational method refines the availability measurement and makes it increasingly relevant, by using its QoS value to good effect.

**Notice: ** It is important not to associate the warning and critical thresholds of a BA with the SLA values.

The final value of the SLA is linked to the time spent in OK, warning / critical conditions (downtime/uptime), which are visible in the “reporting” screens.

Examples:

  • Warning BA setting: 80%;

  • Critical BA setting: 10%;

  • 24/7 monitoring of indicators;

  • Over a 1-day period: * QoS Time spent >= downgraded = 23.5 hours (OK) * QoS Time spent <= critical = 10 minutes (Warning) * QoS Time spent >= critical = 20 minutes (Critical)

In this example, my SLA is neither 80%, nor 10% but: * Uptime >= downgraded = 97.916% (OK) * Uptime <= critical = 0.694% (Warning) * Uptime > critical= 1.388% (Critical)