Версия 17:02, 11 февраля 2016

Monitorimg

Any complicated system need to be monitored. LMA (Logging Monitoring Alerting) Fuel Plugins provides complex logging, monitoring and alerting system for Mirantis Openstack. LMA use open-source products and can be also integrated with currently existing monitoring systems.
In this manual will be described complex configuration include 4 plugins:

ElasticSearch/Kibana – Log Search, Filtration and Analysis
LMA Collector – LMA Toolchain Data Aggregation Client
InfluxDB/Grafana – Time-Series Event Recording and Analysis
LMA Nagios – Alerting for LMA Infrastructure

(more details about MOS fuel plugins: MOS Plugins overview
It is possible to use LMA-related plugins separately but the are designed to be used in complex. In this document all examples are from cloud where all 4 plugins are installed and configured.

LMA DataFlow

LMA is complex system and contains the follwing parts:

Collectd - data collecting
Heka - data collecting and aggregation
InluxDB - non-SQL database for time-based data, in LMA it is is used for charts in Grafana
[https://www.elastic.co/ ElasticSearch - non-SQL database, in LMA it is used to save logs.
Grafana - Graph and dashboard builder
Kibana - Analytics and visualization platform
Nagios - Monitoring system

+--------------------------------+                                                                                   Grafana dashboard                      Kibana Dashboard
|Node N (compute or controller)  |                                                                                  -------------------                 ----------------------
|                                |                                                                                           ^                                       ^
|* collectd --->---+             |                                                                                           |                                       |          
|                  |             |                     +-<- Some data generated locally is looped to-<+                      |                                       |                
|                  |             |                     |    to be aggregated                          |                      |                                       |                
|*  hekad  <---<---+             |                     |                                              ^                      |                                       |                
|    |                           |       +-------------------------------------------+                |                      ^                                       |                
|    +------------->To Aggregator|------>| Heka on aggregator (Controller with VIP)  |---+------------+                      |                                       |                
|    |                           |       +-------------------------------------------+                                       |                                       |                     
|    |                           |               |to         |to            |to                                              |                                       |                   
|    |                           |               |Influx     |ElasticSearch |Nagios                                          |                                       |           
|    +---------------------------|->-------->--- +-------->--|------->------|--->---------->---------->--------------[ InfuxDB       ]                               ^
|    |                           |                           |              |                                                                                        |
|    |                           |                           |              |                                                                                        |
|    +---------------------------|------------------------>--+------->------|--->---------->---------->--------------[ ElasticSearch ]-------------------------------+
|                                |                                          |                                                                    
+--------------------------------+                                          |
                                                                            +----------------------------------------[ Nagios ]---> Alerts (e.g. email notifications)

AFD and GSE

Aggregator

Overview

The process of running alarms in LMA is not centralized (like it is often the case in conventional monitoring systems) but distributed across all the Collectors. Each Collector is individuallly responsible for monitoring the resources and the services that are deployed on the node and for reporting any anomaly or fault it may have detected to the Aggregator.
The anomaly and fault detection logic in LMA is designed more like an “Expert System” in that the Collector and the Aggregator use facts and rules that are executed within the Heka’s stream processing pipeline.

The facts are the messages ingested by the Collector into the Heka pipeline. The rules are either threshold moni- toring alarms or aggregation and correlation rules.

Both are declaratively defined in YAML(tm) files that you can modify. Those rules are executed by a collection of Heka filter plugins written in Lua that are organised according to a configurable processing workflow.

We call these plugins the AFD plugins for Anomaly and Fault Detection plugins and the GSE plugins for Global Status Evaluation plugins. Both the AFD and GSE plugins in turn create metrics called the AFD metrics and the GSE metrics respectively.
The AFD metrics contain information about the health status of a resource like a device, a system component like a filesystem, or service like an API endpoint, at the node level. Then, those AFD metrics are sent on a regular basis by each Collector to the Aggregator where they can be aggregated and correlated hence the name of aggregator.
E.g.
The GSE metrics contain information about the health status of a service cluster, like the Nova API endpoints cluster, or the RabbitMQ cluster as well as the clusters of nodes, like the Compute cluster or Controller cluster.
The health status of a cluster is inferred by the GSE plugins using aggregation and correlation rules and facts contained in the AFD metrics it receives from the Collectors.

Modifying CPU alarm

Modification of existing alarm is detailed explained in LMA collector plugin documentation.
So here is an example how does it look.

Modify alarm

Data flow

collectd.Values(type='cpu',type_instance='idle',plugin='cpu',plugin_instance='0',host='node-6',time=1455198002.296594,interval=10.0,values=[29482997])

:Timestamp: 2016-02-11 12:17:12.296999936 +0000 UTC
:Type: metric
:Hostname: node-6
:Pid: 22518
:Uuid: d40bce11-ccb5-4d52-a7d0-7927424b2709
:Logger: collectd
:Payload: {"type":"cpu","values":[62.2994],"type_instance":"idle","dsnames":["value"],"plugin":"cpu","time":1455193032.297,"interval":10,"host":"node-6","dstypes":["derive"],"plugin_instance":"0"}
:EnvVersion:
:Severity: 6
:Fields:
    | name:"type" type:string value:"derive"
    | name:"source" type:string value:"cpu"
    | name:"deployment_mode" type:string value:"ha_compact"
    | name:"deployment_id" type:string value:"3"
    | name:"openstack_roles" type:string value:"primary-controller"
    | name:"openstack_release" type:string value:"2015.1.0-7.0"
    | name:"tag_fields" type:string value:"cpu_number"
    | name:"openstack_region" type:string value:"RegionOne"
    | name:"name" type:string value:"cpu_idle"
    | name:"hostname" type:string value:"node-6"
    | name:"value" type:double value:62.2994
    | name:"environment_label" type:string value:"test2"
    | name:"interval" type:double value:10
    | name:"cpu_number" type:string value:"0"

Result

Collectd

Collectd is collecting data, all details about collect are in separate document.

Collectd in LMA detailed review

Heka

Heka is comolex tool so data flow in Heka is described in separate documents and divided on parts

Kibana and Grafana

TBD

Nagios

Passive checks overview: ToBeDone!

@@ Строка 54: / Строка 54: @@
 [http://plugins.mirantis.com/docs/l/m/lma_collector/lma_collector-0.8-0.8.0-1.pdf Aggregator]
+===Overview===
 The process of running alarms in LMA is not centralized (like it is often the case in conventional monitoring systems) but distributed across all the Collectors.
 Each Collector is individuallly responsible for monitoring the resources and the services that are deployed on the node and for reporting any anomaly or fault it may have detected to the Aggregator.
@@ Строка 77: / Строка 77: @@
 <BR>
 The health status of a cluster is inferred by the GSE plugins using aggregation and correlation rules and facts contained in the AFD metrics it receives from the Collectors.
+===Modifying CPU alarm===
+Modification of  existing alarm is detailed explained in  [http://plugins.mirantis.com/docs/l/m/lma_collector/lma_collector-0.8-0.8.0-1.pdf LMA collector plugin documentation].<BR>
+So here is an example how does it look.
+====Modify alarm====
+====Data flow====
+<PRE>
+collectd.Values(type='cpu',type_instance='idle',plugin='cpu',plugin_instance='0',host='node-6',time=1455198002.296594,interval=10.0,values=[29482997])
+</PRE>
 <PRE>
@@ Строка 105: / Строка 114: @@
     | name:"cpu_number" type:string value:"0"
 </PRE>
+====Result====
 ==Collectd==

Monitoring: различия между версиями