Hi Yatin,

Thank you for the confirmation! I re-enabled the pacemaker and haproxy roles and I have been since digging into why HA has been failing and I am seeing the following:

1. pacemaker.service won't start due to Corosync not running.
2. Corosync seems to be failing to start due to not having the /etc/corosync/corosync.conf file as it does not exist. 
3. The pcsd log file shows the following errors:
---
Config files sync started
Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes
---
This is what originally led me to believe that it wouldn't work without a proper HA environment with 3 nodes.

The overcloud deployment itself simply times out at "Wait for puppet host configuration to finish". I saw that step_1 seems to be where things are failing (due to pacemaker), and when running it manually, I am seeing the following messages:

---
Debug: Executing: '/bin/systemctl is-enabled -- corosync'
Debug: Executing: '/bin/systemctl is-enabled -- pacemaker'
Debug: Executing: '/bin/systemctl is-active -- pcsd'
Debug: Executing: '/bin/systemctl is-enabled -- pcsd'
Debug: Exec[check-for-local-authentication](provider=posix): Executing check '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to authenticate''
Debug: Executing: '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to authenticate''
Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't be executed because of failed check 'onlyif'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]: '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
Debug: Exec[wait-for-settle](provider=posix): Executing check '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless: Error: error running crm_mon, is pacemaker running?
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:   Could not connect to the CIB: Transport endpoint is not connected
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:   crm_mon: Error: cluster is not available on this node
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 1/360
Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Sleeping for 10.0 seconds between tries
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 2/360
Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Sleeping for 10.0 seconds between tries
---

How does the corosync.conf file get created? Is it related to the pcsd error saying that config sync can't proceed due to the cluster not having a minimum of two members?

Thanks,
James H

On Mon, 28 Dec 2020 at 11:26, YATIN KAREL <yatinkarel@gmail.com> wrote:
Hi James,

On Sun, Dec 27, 2020 at 4:04 PM James Hirst <jdhirst12@gmail.com> wrote:
HI All,

I am attempting to set up a single controller overcloud with tripleo Victoria. I keep running into issues where pcsd is attempting to be started in puppet step 1 on the controller and it fails. I attempted to solve this by simply removing the pacemaker service from my roles_data.yaml file, but then I ran into other errors requiring that the pacemaker service be enabled.

HA Deployment is enabled by default since Ussuri release[1] with [2]. So pacemaker will be deployed by default whether you set up 1 or more controller nodes since Ussuri. Without pacemaker deployment is possible but would need more changes(apart from removing pacemaker from roles_data.yaml file), like adjusting resource_registry to use non pacemaker resources. HA with 1 Controller works fine as we have green jobs[3][4] running with both 1 controller/3 controllers, so would recommend to look why pcsd is failing for you and proceed with HA. But if you still want to go without pacemaker then can try adjusting resource-registry to enable/disable pacemaker resources
 
I have ControllerCount set to 1, which according to the docs is all I need to do to tell tripleo that I'm not using HA.

Docs might be outdated if it specifies just setting ControllerCount to 1 is enough to deploy without a pacemaker, you can report a bug or send a patch to fix that with the docs link you using.


Thanks,
James H
_______________________________________________
users mailing list
users@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/users

To unsubscribe: users-unsubscribe@lists.rdoproject.org

Thanks and regards
Yatin Karel