[rdo-users] Single Controller Environment in Victoria

Mon Dec 28 20:01:39 UTC 2020

Hi Yatin,

I figured out the cause of the issue, I had the hostnames configured wrong,
I was using FQDNs instead of short hostnames and therefore tripleo was
appending the domain name to them incorrectly. I'm not sure why this broke
pacemaker but it works correctly now.

Thanks for your help!
-James H

On Mon, 28 Dec 2020 at 16:15, YATIN KAREL <yatinkarel at gmail.com> wrote:

> Hi James,
>
> On Mon, Dec 28, 2020 at 6:31 PM James Hirst <jdhirst12 at gmail.com> wrote:
>
>> Hi Yatin,
>>
>> I just deleted the overcloud and re-ran the deployment and it got stuck
>> at the same place, when applying puppet host configuration to the
>> controller.
>>
>> I have compared my pcsd.log file to the one you linked to and it seems
>> that mine has far less activity; the only requests I'm seeing received by
>> it are auth requests like this:
>> 200 POST /remote/auth (10.27.0.4) 45.73ms
>>
>> I see pcsd[74686]: WARNING:pcs.daemon:Caught signal: 15, shutting down,
> which looks suspicious.
> Also seems you shared logs from previous runs, can you share the latest
> sos report for controller, overcloud-deploy.log, or it might be due to you
> using deployed servers, can you clean those as well before retry.
>
> I have attached the logs here: ansible.log
>> <https://drive.google.com/file/d/1eN15ZJzn_4GesrqTT2OcYpCigAK8rvnE/view?usp=sharing> controller
>> log bundle
>> <https://drive.google.com/file/d/1cTC3lPRf3wjYB5SW1dNqlp--_038GInj/view?usp=sharing> (note:
>> ansible.log does just end without an error during the deployment as I
>> didn't wait for it to retry 1100 times so I CTRL+C'd it.)
>>
>> Are there any other logs I should gather?
>>
>> Thanks,
>> James H
>>
>> On Mon, 28 Dec 2020 at 13:29, YATIN KAREL <yatinkarel at gmail.com> wrote:
>>
>>> Hi James,
>>>
>>> On Mon, Dec 28, 2020 at 4:28 PM James Hirst <jdhirst12 at gmail.com> wrote:
>>>
>>>> Hi Yatin,
>>>>
>>>> Thank you for the confirmation! I re-enabled the pacemaker and haproxy
>>>> roles and I have been since digging into why HA has been failing and I am
>>>> seeing the following:
>>>>
>>>> 1. pacemaker.service won't start due to Corosync not running.
>>>> 2. Corosync seems to be failing to start due to not having the
>>>> /etc/corosync/corosync.conf file as it does not exist.
>>>> 3. The pcsd log file shows the following errors:
>>>> ---
>>>> Config files sync started
>>>> Config files sync skipped, this host does not seem to be in a cluster
>>>> of at least 2 nodes
>>>> ---
>>>> This is what originally led me to believe that it wouldn't work without
>>>> a proper HA environment with 3 nodes.
>>>>
>>>> That is not fatal, i see same in job logs:-
>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/overcloud-controller-0/var/log/pcsd/pcsd.log.txt.gz
>>>
>>> The overcloud deployment itself simply times out at "Wait for puppet
>>>> host configuration to finish". I saw that step_1 seems to be where things
>>>> are failing (due to pacemaker), and when running it manually, I am seeing
>>>> the following messages:
>>>>
>>>> ---
>>>> Debug: Executing: '/bin/systemctl is-enabled -- corosync'
>>>> Debug: Executing: '/bin/systemctl is-enabled -- pacemaker'
>>>> Debug: Executing: '/bin/systemctl is-active -- pcsd'
>>>> Debug: Executing: '/bin/systemctl is-enabled -- pcsd'
>>>> Debug: Exec[check-for-local-authentication](provider=posix): Executing
>>>> check '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to
>>>> authenticate''
>>>> Debug: Executing: '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable
>>>> to authenticate''
>>>> Debug:
>>>> /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]:
>>>> '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't
>>>> be executed because of failed check 'onlyif'
>>>> Debug:
>>>> /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]:
>>>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>>>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>>>> Debug:
>>>> /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]:
>>>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>>>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>>>> Debug: Exec[wait-for-settle](provider=posix): Executing check
>>>> '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>>>> /dev/null 2>&1'
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>>> Error: error running crm_mon, is pacemaker running?
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>>> Could not connect to the CIB: Transport endpoint is not connected
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>>> crm_mon: Error: cluster is not available on this node
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>> Exec try 1/360
>>>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>>>> /dev/null 2>&1'
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>> Sleeping for 10.0 seconds between tries
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>> Exec try 2/360
>>>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>>>> /dev/null 2>&1'
>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>> Sleeping for 10.0 seconds between tries
>>>> ---
>>>>
>>>> How does the corosync.conf file get created? Is it related to the pcsd
>>>> error saying that config sync can't proceed due to the cluster not having a
>>>> minimum of two members?
>>>>
>>>> No that's not related as per pcsd.log shared above. AFAIK corosync.conf
>>> is created by pcs daemon itself by default when pcsd is used.
>>> You tried that on an already deployed overcloud? If that's just a test
>>> setup try with overcloud delete and fresh install as i am not sure how well
>>> a re deployment with HA enable/disable works. Also share full logs with
>>> this as that will give some hint and also share what docs/steps you are
>>> using to see if some customization is being done. Also on your current
>>> failure /var/log/pcsd/pcsd.log.txt.gz on controller node should also have
>>> some details wrt failure.
>>>
>>> Thanks,
>>>> James H
>>>>
>>>> On Mon, 28 Dec 2020 at 11:26, YATIN KAREL <yatinkarel at gmail.com> wrote:
>>>>
>>>>> Hi James,
>>>>>
>>>>> On Sun, Dec 27, 2020 at 4:04 PM James Hirst <jdhirst12 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HI All,
>>>>>>
>>>>>> I am attempting to set up a single controller overcloud with tripleo
>>>>>> Victoria. I keep running into issues where pcsd is attempting to be started
>>>>>> in puppet step 1 on the controller and it fails. I attempted to solve this
>>>>>> by simply removing the pacemaker service from my roles_data.yaml file, but
>>>>>> then I ran into other errors requiring that the pacemaker service be
>>>>>> enabled.
>>>>>>
>>>>>> HA Deployment is enabled by default since Ussuri release[1] with [2].
>>>>> So pacemaker will be deployed by default whether you set up 1 or more
>>>>> controller nodes since Ussuri. Without pacemaker deployment is possible but
>>>>> would need more changes(apart from removing pacemaker from roles_data.yaml
>>>>> file), like adjusting resource_registry to use non pacemaker resources. HA
>>>>> with 1 Controller works fine as we have green jobs[3][4] running with both
>>>>> 1 controller/3 controllers, so would recommend to look why pcsd is failing
>>>>> for you and proceed with HA. But if you still want to go without pacemaker
>>>>> then can try adjusting resource-registry to enable/disable pacemaker
>>>>> resources
>>>>>
>>>>>
>>>>>> I have ControllerCount set to 1, which according to the docs is all I
>>>>>> need to do to tell tripleo that I'm not using HA.
>>>>>>
>>>>>> Docs might be outdated if it specifies just setting ControllerCount
>>>>> to 1 is enough to deploy without a pacemaker, you can report a bug or send
>>>>> a patch to fix that with the docs link you using.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>> James H
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users at lists.rdoproject.org
>>>>>> http://lists.rdoproject.org/mailman/listinfo/users
>>>>>>
>>>>>> To unsubscribe: users-unsubscribe at lists.rdoproject.org
>>>>>>
>>>>>
>>>>>
>>>>> [1]
>>>>> https://docs.openstack.org/releasenotes/tripleo-heat-templates/ussuri.html#relnotes-12-3-0-stable-ussuri-other-notes
>>>>> [2]
>>>>> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/359060
>>>>> [3]
>>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>>> [4]
>>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-victoria/a5dd4bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>>>
>>>>>
>>>>> Thanks and regards
>>>>> Yatin Karel
>>>>>
>>>>
>>>
>>> Thanks and Regards
>>> Yatin Karel
>>>
>>
>
> --
> Yatin Karel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20201228/1c793931/attachment-0001.html>