[rdo-users] Single Controller Environment in Victoria

Mon Dec 28 13:01:08 UTC 2020

Hi Yatin,

I just deleted the overcloud and re-ran the deployment and it got stuck at
the same place, when applying puppet host configuration to the controller.

I have compared my pcsd.log file to the one you linked to and it seems that
mine has far less activity; the only requests I'm seeing received by it are
auth requests like this:
200 POST /remote/auth (10.27.0.4) 45.73ms

I have attached the logs here: ansible.log
<https://drive.google.com/file/d/1eN15ZJzn_4GesrqTT2OcYpCigAK8rvnE/view?usp=sharing>
controller
log bundle
<https://drive.google.com/file/d/1cTC3lPRf3wjYB5SW1dNqlp--_038GInj/view?usp=sharing>
(note:
ansible.log does just end without an error during the deployment as I
didn't wait for it to retry 1100 times so I CTRL+C'd it.)

Are there any other logs I should gather?

Thanks,
James H

On Mon, 28 Dec 2020 at 13:29, YATIN KAREL <yatinkarel at gmail.com> wrote:

> Hi James,
>
> On Mon, Dec 28, 2020 at 4:28 PM James Hirst <jdhirst12 at gmail.com> wrote:
>
>> Hi Yatin,
>>
>> Thank you for the confirmation! I re-enabled the pacemaker and haproxy
>> roles and I have been since digging into why HA has been failing and I am
>> seeing the following:
>>
>> 1. pacemaker.service won't start due to Corosync not running.
>> 2. Corosync seems to be failing to start due to not having the
>> /etc/corosync/corosync.conf file as it does not exist.
>> 3. The pcsd log file shows the following errors:
>> ---
>> Config files sync started
>> Config files sync skipped, this host does not seem to be in a cluster of
>> at least 2 nodes
>> ---
>> This is what originally led me to believe that it wouldn't work without a
>> proper HA environment with 3 nodes.
>>
>> That is not fatal, i see same in job logs:-
> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/overcloud-controller-0/var/log/pcsd/pcsd.log.txt.gz
>
> The overcloud deployment itself simply times out at "Wait for puppet host
>> configuration to finish". I saw that step_1 seems to be where things are
>> failing (due to pacemaker), and when running it manually, I am seeing the
>> following messages:
>>
>> ---
>> Debug: Executing: '/bin/systemctl is-enabled -- corosync'
>> Debug: Executing: '/bin/systemctl is-enabled -- pacemaker'
>> Debug: Executing: '/bin/systemctl is-active -- pcsd'
>> Debug: Executing: '/bin/systemctl is-enabled -- pcsd'
>> Debug: Exec[check-for-local-authentication](provider=posix): Executing
>> check '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to
>> authenticate''
>> Debug: Executing: '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable
>> to authenticate''
>> Debug:
>> /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]:
>> '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't
>> be executed because of failed check 'onlyif'
>> Debug:
>> /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]:
>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>> Debug:
>> /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]:
>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>> Debug: Exec[wait-for-settle](provider=posix): Executing check '/sbin/pcs
>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>> /dev/null 2>&1'
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>> Error: error running crm_mon, is pacemaker running?
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>> Could not connect to the CIB: Transport endpoint is not connected
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>> crm_mon: Error: cluster is not available on this node
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Exec try 1/360
>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status
>> | grep -q 'partition with quorum' > /dev/null 2>&1'
>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>> /dev/null 2>&1'
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Sleeping for 10.0 seconds between tries
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Exec try 2/360
>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status
>> | grep -q 'partition with quorum' > /dev/null 2>&1'
>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>> /dev/null 2>&1'
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Sleeping for 10.0 seconds between tries
>> ---
>>
>> How does the corosync.conf file get created? Is it related to the pcsd
>> error saying that config sync can't proceed due to the cluster not having a
>> minimum of two members?
>>
>> No that's not related as per pcsd.log shared above. AFAIK corosync.conf
> is created by pcs daemon itself by default when pcsd is used.
> You tried that on an already deployed overcloud? If that's just a test
> setup try with overcloud delete and fresh install as i am not sure how well
> a re deployment with HA enable/disable works. Also share full logs with
> this as that will give some hint and also share what docs/steps you are
> using to see if some customization is being done. Also on your current
> failure /var/log/pcsd/pcsd.log.txt.gz on controller node should also have
> some details wrt failure.
>
> Thanks,
>> James H
>>
>> On Mon, 28 Dec 2020 at 11:26, YATIN KAREL <yatinkarel at gmail.com> wrote:
>>
>>> Hi James,
>>>
>>> On Sun, Dec 27, 2020 at 4:04 PM James Hirst <jdhirst12 at gmail.com> wrote:
>>>
>>>> HI All,
>>>>
>>>> I am attempting to set up a single controller overcloud with tripleo
>>>> Victoria. I keep running into issues where pcsd is attempting to be started
>>>> in puppet step 1 on the controller and it fails. I attempted to solve this
>>>> by simply removing the pacemaker service from my roles_data.yaml file, but
>>>> then I ran into other errors requiring that the pacemaker service be
>>>> enabled.
>>>>
>>>> HA Deployment is enabled by default since Ussuri release[1] with [2].
>>> So pacemaker will be deployed by default whether you set up 1 or more
>>> controller nodes since Ussuri. Without pacemaker deployment is possible but
>>> would need more changes(apart from removing pacemaker from roles_data.yaml
>>> file), like adjusting resource_registry to use non pacemaker resources. HA
>>> with 1 Controller works fine as we have green jobs[3][4] running with both
>>> 1 controller/3 controllers, so would recommend to look why pcsd is failing
>>> for you and proceed with HA. But if you still want to go without pacemaker
>>> then can try adjusting resource-registry to enable/disable pacemaker
>>> resources
>>>
>>>
>>>> I have ControllerCount set to 1, which according to the docs is all I
>>>> need to do to tell tripleo that I'm not using HA.
>>>>
>>>> Docs might be outdated if it specifies just setting ControllerCount to
>>> 1 is enough to deploy without a pacemaker, you can report a bug or send a
>>> patch to fix that with the docs link you using.
>>>
>>>
>>> Thanks,
>>>> James H
>>>> _______________________________________________
>>>> users mailing list
>>>> users at lists.rdoproject.org
>>>> http://lists.rdoproject.org/mailman/listinfo/users
>>>>
>>>> To unsubscribe: users-unsubscribe at lists.rdoproject.org
>>>>
>>>
>>>
>>> [1]
>>> https://docs.openstack.org/releasenotes/tripleo-heat-templates/ussuri.html#relnotes-12-3-0-stable-ussuri-other-notes
>>> [2]
>>> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/359060
>>> [3]
>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>> [4]
>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-victoria/a5dd4bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>
>>>
>>> Thanks and regards
>>> Yatin Karel
>>>
>>
>
> Thanks and Regards
> Yatin Karel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20201228/bacbbae3/attachment-0001.html>