[rdo-users] Single Controller Environment in Victoria

Mon Dec 28 15:14:58 UTC 2020

Hi James,

On Mon, Dec 28, 2020 at 6:31 PM James Hirst <jdhirst12 at gmail.com> wrote:

> Hi Yatin,
>
> I just deleted the overcloud and re-ran the deployment and it got stuck at
> the same place, when applying puppet host configuration to the controller.
>
> I have compared my pcsd.log file to the one you linked to and it seems
> that mine has far less activity; the only requests I'm seeing received by
> it are auth requests like this:
> 200 POST /remote/auth (10.27.0.4) 45.73ms
>
> I see pcsd[74686]: WARNING:pcs.daemon:Caught signal: 15, shutting down,
which looks suspicious.
Also seems you shared logs from previous runs, can you share the latest sos
report for controller, overcloud-deploy.log, or it might be due to you
using deployed servers, can you clean those as well before retry.

I have attached the logs here: ansible.log
> <https://drive.google.com/file/d/1eN15ZJzn_4GesrqTT2OcYpCigAK8rvnE/view?usp=sharing> controller
> log bundle
> <https://drive.google.com/file/d/1cTC3lPRf3wjYB5SW1dNqlp--_038GInj/view?usp=sharing> (note:
> ansible.log does just end without an error during the deployment as I
> didn't wait for it to retry 1100 times so I CTRL+C'd it.)
>
> Are there any other logs I should gather?
>
> Thanks,
> James H
>
> On Mon, 28 Dec 2020 at 13:29, YATIN KAREL <yatinkarel at gmail.com> wrote:
>
>> Hi James,
>>
>> On Mon, Dec 28, 2020 at 4:28 PM James Hirst <jdhirst12 at gmail.com> wrote:
>>
>>> Hi Yatin,
>>>
>>> Thank you for the confirmation! I re-enabled the pacemaker and haproxy
>>> roles and I have been since digging into why HA has been failing and I am
>>> seeing the following:
>>>
>>> 1. pacemaker.service won't start due to Corosync not running.
>>> 2. Corosync seems to be failing to start due to not having the
>>> /etc/corosync/corosync.conf file as it does not exist.
>>> 3. The pcsd log file shows the following errors:
>>> ---
>>> Config files sync started
>>> Config files sync skipped, this host does not seem to be in a cluster of
>>> at least 2 nodes
>>> ---
>>> This is what originally led me to believe that it wouldn't work without
>>> a proper HA environment with 3 nodes.
>>>
>>> That is not fatal, i see same in job logs:-
>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/overcloud-controller-0/var/log/pcsd/pcsd.log.txt.gz
>>
>> The overcloud deployment itself simply times out at "Wait for puppet host
>>> configuration to finish". I saw that step_1 seems to be where things are
>>> failing (due to pacemaker), and when running it manually, I am seeing the
>>> following messages:
>>>
>>> ---
>>> Debug: Executing: '/bin/systemctl is-enabled -- corosync'
>>> Debug: Executing: '/bin/systemctl is-enabled -- pacemaker'
>>> Debug: Executing: '/bin/systemctl is-active -- pcsd'
>>> Debug: Executing: '/bin/systemctl is-enabled -- pcsd'
>>> Debug: Exec[check-for-local-authentication](provider=posix): Executing
>>> check '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to
>>> authenticate''
>>> Debug: Executing: '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable
>>> to authenticate''
>>> Debug:
>>> /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]:
>>> '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't
>>> be executed because of failed check 'onlyif'
>>> Debug:
>>> /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]:
>>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>>> Debug:
>>> /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]:
>>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>>> Debug: Exec[wait-for-settle](provider=posix): Executing check '/sbin/pcs
>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>>> /dev/null 2>&1'
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>> Error: error running crm_mon, is pacemaker running?
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>> Could not connect to the CIB: Transport endpoint is not connected
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>> crm_mon: Error: cluster is not available on this node
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>> Exec try 1/360
>>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>>> /dev/null 2>&1'
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>> Sleeping for 10.0 seconds between tries
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>> Exec try 2/360
>>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' >
>>> /dev/null 2>&1'
>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>> Sleeping for 10.0 seconds between tries
>>> ---
>>>
>>> How does the corosync.conf file get created? Is it related to the pcsd
>>> error saying that config sync can't proceed due to the cluster not having a
>>> minimum of two members?
>>>
>>> No that's not related as per pcsd.log shared above. AFAIK corosync.conf
>> is created by pcs daemon itself by default when pcsd is used.
>> You tried that on an already deployed overcloud? If that's just a test
>> setup try with overcloud delete and fresh install as i am not sure how well
>> a re deployment with HA enable/disable works. Also share full logs with
>> this as that will give some hint and also share what docs/steps you are
>> using to see if some customization is being done. Also on your current
>> failure /var/log/pcsd/pcsd.log.txt.gz on controller node should also have
>> some details wrt failure.
>>
>> Thanks,
>>> James H
>>>
>>> On Mon, 28 Dec 2020 at 11:26, YATIN KAREL <yatinkarel at gmail.com> wrote:
>>>
>>>> Hi James,
>>>>
>>>> On Sun, Dec 27, 2020 at 4:04 PM James Hirst <jdhirst12 at gmail.com>
>>>> wrote:
>>>>
>>>>> HI All,
>>>>>
>>>>> I am attempting to set up a single controller overcloud with tripleo
>>>>> Victoria. I keep running into issues where pcsd is attempting to be started
>>>>> in puppet step 1 on the controller and it fails. I attempted to solve this
>>>>> by simply removing the pacemaker service from my roles_data.yaml file, but
>>>>> then I ran into other errors requiring that the pacemaker service be
>>>>> enabled.
>>>>>
>>>>> HA Deployment is enabled by default since Ussuri release[1] with [2].
>>>> So pacemaker will be deployed by default whether you set up 1 or more
>>>> controller nodes since Ussuri. Without pacemaker deployment is possible but
>>>> would need more changes(apart from removing pacemaker from roles_data.yaml
>>>> file), like adjusting resource_registry to use non pacemaker resources. HA
>>>> with 1 Controller works fine as we have green jobs[3][4] running with both
>>>> 1 controller/3 controllers, so would recommend to look why pcsd is failing
>>>> for you and proceed with HA. But if you still want to go without pacemaker
>>>> then can try adjusting resource-registry to enable/disable pacemaker
>>>> resources
>>>>
>>>>
>>>>> I have ControllerCount set to 1, which according to the docs is all I
>>>>> need to do to tell tripleo that I'm not using HA.
>>>>>
>>>>> Docs might be outdated if it specifies just setting ControllerCount to
>>>> 1 is enough to deploy without a pacemaker, you can report a bug or send a
>>>> patch to fix that with the docs link you using.
>>>>
>>>>
>>>> Thanks,
>>>>> James H
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at lists.rdoproject.org
>>>>> http://lists.rdoproject.org/mailman/listinfo/users
>>>>>
>>>>> To unsubscribe: users-unsubscribe at lists.rdoproject.org
>>>>>
>>>>
>>>>
>>>> [1]
>>>> https://docs.openstack.org/releasenotes/tripleo-heat-templates/ussuri.html#relnotes-12-3-0-stable-ussuri-other-notes
>>>> [2]
>>>> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/359060
>>>> [3]
>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>> [4]
>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-victoria/a5dd4bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>>
>>>>
>>>> Thanks and regards
>>>> Yatin Karel
>>>>
>>>
>>
>> Thanks and Regards
>> Yatin Karel
>>
>

-- 
Yatin Karel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20201228/ebfaa74b/attachment.html>