Pacemaker Corosync: различия между версиями

Материал из noname.com.ua
Перейти к навигацииПерейти к поиску
Строка 62: Строка 62:
 
pcs status resources
 
pcs status resources
 
</PRE>
 
</PRE>
  +
  +
  +
https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/nodes.html#tracking-node-health
  +
  +
  +
=2=
  +
  +
  +
All Controller services go down due to a full partition issue.
  +
Horizon Web Interface is down
  +
Cluster Public IP is unreachable
  +
Floating IPs are unreachable
  +
A "health_disk" red value change is traceable on the affected Controller /var/log/pacemaker.log file due to a filled partition.
  +
# grep 'health_disk.*value="red"' /var/log/pacemaker.log
  +
May 10 02:53:21 [9160] node-X.default.ltd cib: info: cib_perform_op: ++ /cib/status/node_state[@id='X']/transient_attributes[@id='X']/instance_attributes[@id='status-X']: <nvpair id="status-X-#health_disk" name="#health_disk" value="red"/>
  +
​After the health disk change you will see a health strategy: "migrate-on-red"
  +
May 10 02:53:21 [9164] node-X.default.ltd pengine: info: apply_system_health: Applying automated node health strategy: migrate-on-red
  +
On a Cluster of only one Controller; all services leave the node and can't find a suitable node
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: determine_online_status: Node node-X.default.ltd is online
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: determine_op_status: Operation monitor found resource p_vrouter:0 active on node-X.default.ltd
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: apply_system_health: Applying automated node health strategy: migrate-on-red
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: apply_system_health: Node node-X.default.ltd has an combined system health of -1000000
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: sysinfo_node-X.default.ltd (ocf::pacemaker:SysInfo): Stopped
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_vrouter [p_vrouter]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__management (ocf::fuel:ns_IPaddr2): Stopped
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Stopped
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__vrouter (ocf::fuel:ns_IPaddr2): Stopped
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__public (ocf::fuel:ns_IPaddr2): Stopped
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_haproxy [p_haproxy]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_mysql [p_mysql]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_dns [p_dns]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Master/Slave Set: master_p_conntrackd [p_conntrackd]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_heat-engine [p_heat-engine]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_ping_vip__public [ping_vip__public]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_ntp [p_ntp]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ]
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource sysinfo_node-X.default.ltd cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: clone_p_vrouter: Rolling back scores from clone_p_dns
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_vrouter:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: clone_p_haproxy: Rolling back scores from vip__management
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: clone_p_haproxy: Rolling back scores from vip__public
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_haproxy:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__management cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: vip__vrouter_pub: Rolling back scores from master_p_conntrackd
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: vip__vrouter_pub: Rolling back scores from vip__vrouter
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__vrouter_pub cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__vrouter cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__public cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_mysql:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_rabbitmq-server:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: master_color: master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: notice: clone_rsc_colocation_rh: Cannot pair p_dns:0 with instance of clone_p_vrouter
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_dns:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: p_conntrackd:0: Rolling back scores from vip__vrouter_pub
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_conntrackd:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: master_color: master_p_conntrackd: Promoted 0 instances of a possible 1 to master
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_heat-engine:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-plugin-openvswitch-agent:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-l3-agent:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-dhcp-agent:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-metadata-agent:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource ping_vip__public:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_ntp:0 cannot run anywhere
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave sysinfo_node-X.default.ltd (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_vrouter:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__management (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__vrouter_pub (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__vrouter (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__public (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_haproxy:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_mysql:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_rabbitmq-server:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_dns:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_conntrackd:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_heat-engine:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-plugin-openvswitch-agent:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-l3-agent:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-dhcp-agent:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-metadata-agent:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave ping_vip__public:0 (Stopped)
  +
May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_ntp:0 (Stopped)
  +
EnvironmentThe characteristics of the issue that are not part of the problem (Product/s, Release, Version/s, Operating System, Component/s)
  +
Mirantis OpenStack 8.0, 9.2
  +
Pacemaker
  +
HA Mode cluster with only 1 Controller
  +
ResolutionNumbered series of steps about how to fix the issue
  +
Determine the which of the partitions is full
  +
# df -h / /var/log /var/lib/mysql
  +
Clean up the partition that is full. Full partitions can have different causes, but the partitions that are most likely to get full are: /var/log and /var/lib/mysql
  +
/var/log - How to Clean up a full /var/log/ partition in a Controller on Mirantis OpenStack 8.0
  +
/var/lib/mysql - How to purge MySQL binary logs to recover space on MySQL partition.
  +
Stop and Start Pacemaker
  +
# service pacemaker stop
  +
# service pacemaker start
  +
CauseWhy the issue occurred – if known
  +
Mirantis OpenStack 8.0 utilizes a Pacemaker/Corosync feature not previously used in Mirantis OpenStack; the node-health-strategy=migrate-on-red is now setup by default.
  +
# crm configure show cib-bootstrap-options
  +
property cib-bootstrap-options: \
  +
dc-version=1.1.12-561c4cf \
  +
cluster-infrastructure=corosync \
  +
no-quorum-policy=stop \
  +
cluster-recheck-interval=190s \
  +
stonith-enabled=false \
  +
start-failure-is-fatal=false \
  +
symmetric-cluster=false \
  +
last-lrm-refresh=1461091450 \
  +
node-health-strategy=migrate-on-red
  +
There are new Pacemaker/Corosync primitives that will monitor the node health; each Controller node should have a sysinfo_node-x primitive.
  +
# crm configure show sysinfo_*
  +
primitive sysinfo_node-1.default.tld ocf:pacemaker:SysInfo \
  +
op monitor interval=15s \
  +
params disks="/ /var/log /var/lib/mysql" min_disk_free=512M disk_unit=M
  +
There are 3 monitored partitions:
  +
/
  +
/var/log
  +
/var/lib/mysql
  +
The node health will change to red if the the minimum disk space of 512 MB is reached in any of the above partitions; triggering the node services to be evacuated to another node. The main issue in a Cluster of 1 node is that once the services are migrated, there is no additional node to evacuated the services to.
  +
  +
  +
SHOW GLOBAL STATUS LIKE 'wsrep_%';
  +
  +
=3=
  +
Validated External
  +
ProblemSymptoms of the problem being solved, objectives of the procedure or additional details about the question being asked that summarize what the customer is experiencing.
  +
MySQL is not running on one of the controller nodes
  +
# pcs status
  +
Clone Set: clone_p_mysql [p_mysql]
  +
Started: [ b05-39-controller.domain.tld b06-39-controller.domain.tld ]
  +
Stopped: [ b05-38-controller.domain.tld ]
  +
EnvironmentThe characteristics of the issue that are not part of the problem (Product/s, Release, Version/s, Operating System, Component/s)
  +
Mirantis OpenStack 7.0 and higher
  +
MySQL
  +
Pacemaker
  +
Galera
  +
ResolutionNumbered series of steps about how to fix the issue
  +
Make sure MySQL is stopped on problematic node
  +
# ps -ef | grep mysql
  +
root 14878 5566 0 14:33 pts/0 00:00:00 grep --color=auto mysql
  +
Edit the default timeout for start operation
  +
# crm configure edit p_mysql
  +
Note: for Mirantis OpenStack 9.0 and higher, run the following command:
  +
# crm configure edit p_mysqld
  +
Set the temporary timeout value for p_mysql-start-0 to be 1200
  +
# pcs resource show p_mysql
  +
Resource: p_mysql (class=ocf provider=fuel type=mysql-wss)
  +
Attributes: test_user=wsrep_sst test_passwd=??? socket=/var/run/mysqld/mysqld.sock
  +
Operations: monitor interval=60 timeout=55 (p_mysql-monitor-60)
  +
start interval=0 timeout=1200 (p_mysql-start-0)
  +
stop interval=0 timeout=120 (p_mysql-stop-0)
  +
Cleanup the p_mysql resource
  +
# crm resource cleanup p_mysql
  +
Wait for 15-20 minutes for synchronization to be completed
  +
Run pcs status again to check mysql is back up and running on the problematic controller node
  +
# pcs status
  +
Clone Set: clone_p_mysql [p_mysql]
  +
Started: [ b05-38-controller.domain.tld b05-39-controller.domain.tld b06-39-controller.domain.tld ]
  +
Ensure that the cluster is synced again
  +
| wsrep_local_state_comment | Synced |
  +
| wsrep_cert_index_size | 899 |
  +
| wsrep_causal_reads | 0 |
  +
| wsrep_incoming_addresses | 10.128.0.133:3307,10.128.0.132:3307,10.128.0.134:3307
  +
8. Reset the timeout value for p_mysql-start-0 to default value.

Версия 11:01, 7 февраля 2024

Восстановление Galera запущенной под PCS

Это заметка о том как восстанавливать разваленный кластер, что бы не потерять если еще раз понадобится.

From any controller in the cluster, disable MySQL resource in Pacemaker by running the following command:

pcs resource disable clone_p_mysqld

Wait a certain amount of time for MySQL to shut down. Verify that clone set clone_p_mysql is stopped on all controllers: pcs status resources On every controller in the cluster remove content of MySQL data directory (or move it to a different place):

mv /var/lib/mysql/* /tmp/mysql/

Choose one of controllers that you are going to restore first. As an example, we will choose controller node named as controller-x Copy database backup to MySQL data directory on controller-x:

cp -R /ext-volume/mysql-backup/* /var/lib/mysql/

Change the owner of MySQL data directory on controller-x:

chown -R mysql:mysql /var/lib/mysql

Export variables for mysql-wss and start mysqld on controller-x:

export OCF_RESOURCE_INSTANCE=p_mysqld
export OCF_ROOT=/usr/lib/ocf
export OCF_RESKEY_socket=/var/run/mysqld/mysqld.sock
export OCF_RESKEY_master_timeout=10
export OCF_RESKEY_test_passwd=`crm_resource -r p_mysqld -g test_passwd`
export OCF_RESKEY_test_user=`crm_resource -r p_mysqld -g test_user`
export OCF_RESKEY_additional_parameters="--wsrep-new-cluster"
/usr/lib/ocf/resource.d/fuel/mysql-wss start

Execute monitor operation on controller-x to update Galera GTID in Pacemaker cluster configuration:

/usr/lib/ocf/resource.d/fuel/mysql-wss monitor

Export variables for mysql-wss and start mysqld on all other controllers:

export OCF_RESOURCE_INSTANCE=p_mysqld
export OCF_ROOT=/usr/lib/ocf
export OCF_RESKEY_socket=/var/run/mysqld/mysqld.sock
export OCF_RESKEY_master_timeout=10
export OCF_RESKEY_test_passwd=`crm_resource -r p_mysqld -g test_passwd`
export OCF_RESKEY_test_user=`crm_resource -r p_mysqld -g test_user`
/usr/lib/ocf/resource.d/fuel/mysql-wss start

From any controller in the cluster, enable MySQL resource in Pacemaker by running the following command:

pcs resource enable clone_p_mysqld

Verify that clone set clone_p_mysqld is running on all controllers:

pcs status resources


https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/nodes.html#tracking-node-health


2

All Controller services go down due to a full partition issue. Horizon Web Interface is down Cluster Public IP is unreachable Floating IPs are unreachable A "health_disk" red value change is traceable on the affected Controller /var/log/pacemaker.log file due to a filled partition.

  1. grep 'health_disk.*value="red"' /var/log/pacemaker.log

May 10 02:53:21 [9160] node-X.default.ltd cib: info: cib_perform_op: ++ /cib/status/node_state[@id='X']/transient_attributes[@id='X']/instance_attributes[@id='status-X']: <nvpair id="status-X-#health_disk" name="#health_disk" value="red"/> ​After the health disk change you will see a health strategy: "migrate-on-red" May 10 02:53:21 [9164] node-X.default.ltd pengine: info: apply_system_health: Applying automated node health strategy: migrate-on-red On a Cluster of only one Controller; all services leave the node and can't find a suitable node May 10 02:53:51 [9164] node-X.default.ltd pengine: info: determine_online_status: Node node-X.default.ltd is online May 10 02:53:51 [9164] node-X.default.ltd pengine: info: determine_op_status: Operation monitor found resource p_vrouter:0 active on node-X.default.ltd May 10 02:53:51 [9164] node-X.default.ltd pengine: info: apply_system_health: Applying automated node health strategy: migrate-on-red May 10 02:53:51 [9164] node-X.default.ltd pengine: info: apply_system_health: Node node-X.default.ltd has an combined system health of -1000000 May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: sysinfo_node-X.default.ltd (ocf::pacemaker:SysInfo): Stopped May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_vrouter [p_vrouter] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__management (ocf::fuel:ns_IPaddr2): Stopped May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Stopped May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__vrouter (ocf::fuel:ns_IPaddr2): Stopped May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_print: vip__public (ocf::fuel:ns_IPaddr2): Stopped May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_haproxy [p_haproxy] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_mysql [p_mysql] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_dns [p_dns] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Master/Slave Set: master_p_conntrackd [p_conntrackd] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_heat-engine [p_heat-engine] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_ping_vip__public [ping_vip__public] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: clone_print: Clone Set: clone_p_ntp [p_ntp] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: short_print: Stopped: [ node-X.default.ltd ] May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource sysinfo_node-X.default.ltd cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: clone_p_vrouter: Rolling back scores from clone_p_dns May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_vrouter:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: clone_p_haproxy: Rolling back scores from vip__management May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: clone_p_haproxy: Rolling back scores from vip__public May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_haproxy:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__management cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: vip__vrouter_pub: Rolling back scores from master_p_conntrackd May 10 02:53:51 [9164] node-X.default.ltd pengine: info: rsc_merge_weights: vip__vrouter_pub: Rolling back scores from vip__vrouter May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__vrouter_pub cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__vrouter cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource vip__public cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_mysql:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_rabbitmq-server:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: master_color: master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master May 10 02:53:51 [9164] node-X.default.ltd pengine: notice: clone_rsc_colocation_rh: Cannot pair p_dns:0 with instance of clone_p_vrouter May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_dns:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: p_conntrackd:0: Rolling back scores from vip__vrouter_pub May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_conntrackd:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: master_color: master_p_conntrackd: Promoted 0 instances of a possible 1 to master May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_heat-engine:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-plugin-openvswitch-agent:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-l3-agent:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-dhcp-agent:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_neutron-metadata-agent:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource ping_vip__public:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: native_color: Resource p_ntp:0 cannot run anywhere May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave sysinfo_node-X.default.ltd (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_vrouter:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__management (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__vrouter_pub (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__vrouter (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave vip__public (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_haproxy:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_mysql:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_rabbitmq-server:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_dns:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_conntrackd:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_heat-engine:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-plugin-openvswitch-agent:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-l3-agent:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-dhcp-agent:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_neutron-metadata-agent:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave ping_vip__public:0 (Stopped) May 10 02:53:51 [9164] node-X.default.ltd pengine: info: LogActions: Leave p_ntp:0 (Stopped) EnvironmentThe characteristics of the issue that are not part of the problem (Product/s, Release, Version/s, Operating System, Component/s) Mirantis OpenStack 8.0, 9.2 Pacemaker HA Mode cluster with only 1 Controller ResolutionNumbered series of steps about how to fix the issue Determine the which of the partitions is full

  1. df -h / /var/log /var/lib/mysql

Clean up the partition that is full. Full partitions can have different causes, but the partitions that are most likely to get full are: /var/log and /var/lib/mysql /var/log - How to Clean up a full /var/log/ partition in a Controller on Mirantis OpenStack 8.0 /var/lib/mysql - How to purge MySQL binary logs to recover space on MySQL partition. Stop and Start Pacemaker

  1. service pacemaker stop
  2. service pacemaker start

CauseWhy the issue occurred – if known Mirantis OpenStack 8.0 utilizes a Pacemaker/Corosync feature not previously used in Mirantis OpenStack; the node-health-strategy=migrate-on-red is now setup by default.

  1. crm configure show cib-bootstrap-options

property cib-bootstrap-options: \ dc-version=1.1.12-561c4cf \ cluster-infrastructure=corosync \ no-quorum-policy=stop \ cluster-recheck-interval=190s \ stonith-enabled=false \ start-failure-is-fatal=false \ symmetric-cluster=false \ last-lrm-refresh=1461091450 \ node-health-strategy=migrate-on-red There are new Pacemaker/Corosync primitives that will monitor the node health; each Controller node should have a sysinfo_node-x primitive.

  1. crm configure show sysinfo_*

primitive sysinfo_node-1.default.tld ocf:pacemaker:SysInfo \ op monitor interval=15s \ params disks="/ /var/log /var/lib/mysql" min_disk_free=512M disk_unit=M There are 3 monitored partitions: / /var/log /var/lib/mysql The node health will change to red if the the minimum disk space of 512 MB is reached in any of the above partitions; triggering the node services to be evacuated to another node. The main issue in a Cluster of 1 node is that once the services are migrated, there is no additional node to evacuated the services to.


SHOW GLOBAL STATUS LIKE 'wsrep_%';

3

Validated External ProblemSymptoms of the problem being solved, objectives of the procedure or additional details about the question being asked that summarize what the customer is experiencing. MySQL is not running on one of the controller nodes

  1. pcs status

Clone Set: clone_p_mysql [p_mysql]

   Started: [ b05-39-controller.domain.tld b06-39-controller.domain.tld ]
   Stopped: [ b05-38-controller.domain.tld ]

EnvironmentThe characteristics of the issue that are not part of the problem (Product/s, Release, Version/s, Operating System, Component/s) Mirantis OpenStack 7.0 and higher MySQL Pacemaker Galera ResolutionNumbered series of steps about how to fix the issue Make sure MySQL is stopped on problematic node

  1. ps -ef | grep mysql

root 14878 5566 0 14:33 pts/0 00:00:00 grep --color=auto mysql Edit the default timeout for start operation

  1. crm configure edit p_mysql

Note: for Mirantis OpenStack 9.0 and higher, run the following command:

  1. crm configure edit p_mysqld

Set the temporary timeout value for p_mysql-start-0 to be 1200

  1. pcs resource show p_mysql
  Resource: p_mysql (class=ocf provider=fuel type=mysql-wss)
   Attributes: test_user=wsrep_sst test_passwd=??? socket=/var/run/mysqld/mysqld.sock 
   Operations: monitor interval=60 timeout=55 (p_mysql-monitor-60)
               start interval=0 timeout=1200 (p_mysql-start-0)
               stop interval=0 timeout=120 (p_mysql-stop-0)

Cleanup the p_mysql resource

  1. crm resource cleanup p_mysql

Wait for 15-20 minutes for synchronization to be completed Run pcs status again to check mysql is back up and running on the problematic controller node

  1. pcs status

Clone Set: clone_p_mysql [p_mysql]

   Started: [ b05-38-controller.domain.tld b05-39-controller.domain.tld b06-39-controller.domain.tld ]

Ensure that the cluster is synced again | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 899 | | wsrep_causal_reads | 0 | | wsrep_incoming_addresses | 10.128.0.133:3307,10.128.0.132:3307,10.128.0.134:3307

      8. Reset the timeout value for p_mysql-start-0 to default value.