[rdo-users] Single Controller Environment in Victoria
YATIN KAREL
yatinkarel at gmail.com
Tue Dec 29 04:52:11 UTC 2020
Hi James,
On Tue, Dec 29, 2020 at 1:31 AM James Hirst <jdhirst12 at gmail.com> wrote:
> Hi Yatin,
>
> I figured out the cause of the issue, I had the hostnames configured
> wrong, I was using FQDNs instead of short hostnames and therefore tripleo
> was appending the domain name to them incorrectly. I'm not sure why this
> broke pacemaker but it works correctly now.
>
> Thanks for the update. I too don't know why this happened, i will try to
sync with people with much insights on this topic once they are back post
holidays. If it's a requirement then I feel some pre validation step could
avoid such issues.
Thanks for your help!
> -James H
>
> On Mon, 28 Dec 2020 at 16:15, YATIN KAREL <yatinkarel at gmail.com> wrote:
>
>> Hi James,
>>
>> On Mon, Dec 28, 2020 at 6:31 PM James Hirst <jdhirst12 at gmail.com> wrote:
>>
>>> Hi Yatin,
>>>
>>> I just deleted the overcloud and re-ran the deployment and it got stuck
>>> at the same place, when applying puppet host configuration to the
>>> controller.
>>>
>>> I have compared my pcsd.log file to the one you linked to and it seems
>>> that mine has far less activity; the only requests I'm seeing received by
>>> it are auth requests like this:
>>> 200 POST /remote/auth (10.27.0.4) 45.73ms
>>>
>>> I see pcsd[74686]: WARNING:pcs.daemon:Caught signal: 15, shutting down,
>> which looks suspicious.
>> Also seems you shared logs from previous runs, can you share the latest
>> sos report for controller, overcloud-deploy.log, or it might be due to you
>> using deployed servers, can you clean those as well before retry.
>>
>> I have attached the logs here: ansible.log
>>> <https://drive.google.com/file/d/1eN15ZJzn_4GesrqTT2OcYpCigAK8rvnE/view?usp=sharing> controller
>>> log bundle
>>> <https://drive.google.com/file/d/1cTC3lPRf3wjYB5SW1dNqlp--_038GInj/view?usp=sharing> (note:
>>> ansible.log does just end without an error during the deployment as I
>>> didn't wait for it to retry 1100 times so I CTRL+C'd it.)
>>>
>>> Are there any other logs I should gather?
>>>
>>> Thanks,
>>> James H
>>>
>>> On Mon, 28 Dec 2020 at 13:29, YATIN KAREL <yatinkarel at gmail.com> wrote:
>>>
>>>> Hi James,
>>>>
>>>> On Mon, Dec 28, 2020 at 4:28 PM James Hirst <jdhirst12 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Yatin,
>>>>>
>>>>> Thank you for the confirmation! I re-enabled the pacemaker and haproxy
>>>>> roles and I have been since digging into why HA has been failing and I am
>>>>> seeing the following:
>>>>>
>>>>> 1. pacemaker.service won't start due to Corosync not running.
>>>>> 2. Corosync seems to be failing to start due to not having the
>>>>> /etc/corosync/corosync.conf file as it does not exist.
>>>>> 3. The pcsd log file shows the following errors:
>>>>> ---
>>>>> Config files sync started
>>>>> Config files sync skipped, this host does not seem to be in a cluster
>>>>> of at least 2 nodes
>>>>> ---
>>>>> This is what originally led me to believe that it wouldn't work
>>>>> without a proper HA environment with 3 nodes.
>>>>>
>>>>> That is not fatal, i see same in job logs:-
>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/overcloud-controller-0/var/log/pcsd/pcsd.log.txt.gz
>>>>
>>>> The overcloud deployment itself simply times out at "Wait for puppet
>>>>> host configuration to finish". I saw that step_1 seems to be where things
>>>>> are failing (due to pacemaker), and when running it manually, I am seeing
>>>>> the following messages:
>>>>>
>>>>> ---
>>>>> Debug: Executing: '/bin/systemctl is-enabled -- corosync'
>>>>> Debug: Executing: '/bin/systemctl is-enabled -- pacemaker'
>>>>> Debug: Executing: '/bin/systemctl is-active -- pcsd'
>>>>> Debug: Executing: '/bin/systemctl is-enabled -- pcsd'
>>>>> Debug: Exec[check-for-local-authentication](provider=posix): Executing
>>>>> check '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to
>>>>> authenticate''
>>>>> Debug: Executing: '/sbin/pcs status pcsd controller 2>&1 | grep
>>>>> 'Unable to authenticate''
>>>>> Debug:
>>>>> /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]:
>>>>> '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't
>>>>> be executed because of failed check 'onlyif'
>>>>> Debug:
>>>>> /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]:
>>>>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>>>>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>>>>> Debug:
>>>>> /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]:
>>>>> '/sbin/pcs host auth controller.cloud.hirstgroup.net -u hacluster -p
>>>>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check 'refreshonly'
>>>>> Debug: Exec[wait-for-settle](provider=posix): Executing check
>>>>> '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum'
>>>>> > /dev/null 2>&1'
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>>>> Error: error running crm_mon, is pacemaker running?
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>>>> Could not connect to the CIB: Transport endpoint is not connected
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>>>>> crm_mon: Error: cluster is not available on this node
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>>> Exec try 1/360
>>>>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>>>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum'
>>>>> > /dev/null 2>&1'
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>>> Sleeping for 10.0 seconds between tries
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>>> Exec try 2/360
>>>>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>>>>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>>>>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum'
>>>>> > /dev/null 2>&1'
>>>>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>>>>> Sleeping for 10.0 seconds between tries
>>>>> ---
>>>>>
>>>>> How does the corosync.conf file get created? Is it related to the pcsd
>>>>> error saying that config sync can't proceed due to the cluster not having a
>>>>> minimum of two members?
>>>>>
>>>>> No that's not related as per pcsd.log shared above. AFAIK
>>>> corosync.conf is created by pcs daemon itself by default when pcsd is used.
>>>> You tried that on an already deployed overcloud? If that's just a test
>>>> setup try with overcloud delete and fresh install as i am not sure how well
>>>> a re deployment with HA enable/disable works. Also share full logs with
>>>> this as that will give some hint and also share what docs/steps you are
>>>> using to see if some customization is being done. Also on your current
>>>> failure /var/log/pcsd/pcsd.log.txt.gz on controller node should also have
>>>> some details wrt failure.
>>>>
>>>> Thanks,
>>>>> James H
>>>>>
>>>>> On Mon, 28 Dec 2020 at 11:26, YATIN KAREL <yatinkarel at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi James,
>>>>>>
>>>>>> On Sun, Dec 27, 2020 at 4:04 PM James Hirst <jdhirst12 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HI All,
>>>>>>>
>>>>>>> I am attempting to set up a single controller overcloud with tripleo
>>>>>>> Victoria. I keep running into issues where pcsd is attempting to be started
>>>>>>> in puppet step 1 on the controller and it fails. I attempted to solve this
>>>>>>> by simply removing the pacemaker service from my roles_data.yaml file, but
>>>>>>> then I ran into other errors requiring that the pacemaker service be
>>>>>>> enabled.
>>>>>>>
>>>>>>> HA Deployment is enabled by default since Ussuri release[1] with
>>>>>> [2]. So pacemaker will be deployed by default whether you set up 1 or more
>>>>>> controller nodes since Ussuri. Without pacemaker deployment is possible but
>>>>>> would need more changes(apart from removing pacemaker from roles_data.yaml
>>>>>> file), like adjusting resource_registry to use non pacemaker resources. HA
>>>>>> with 1 Controller works fine as we have green jobs[3][4] running with both
>>>>>> 1 controller/3 controllers, so would recommend to look why pcsd is failing
>>>>>> for you and proceed with HA. But if you still want to go without pacemaker
>>>>>> then can try adjusting resource-registry to enable/disable pacemaker
>>>>>> resources
>>>>>>
>>>>>>
>>>>>>> I have ControllerCount set to 1, which according to the docs is all
>>>>>>> I need to do to tell tripleo that I'm not using HA.
>>>>>>>
>>>>>>> Docs might be outdated if it specifies just setting ControllerCount
>>>>>> to 1 is enough to deploy without a pacemaker, you can report a bug or send
>>>>>> a patch to fix that with the docs link you using.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>> James H
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users at lists.rdoproject.org
>>>>>>> http://lists.rdoproject.org/mailman/listinfo/users
>>>>>>>
>>>>>>> To unsubscribe: users-unsubscribe at lists.rdoproject.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://docs.openstack.org/releasenotes/tripleo-heat-templates/ussuri.html#relnotes-12-3-0-stable-ussuri-other-notes
>>>>>> [2]
>>>>>> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/359060
>>>>>> [3]
>>>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria/0bccbf6/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>>>> [4]
>>>>>> https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-victoria/a5dd4bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
>>>>>>
>>>>>>
>>>>>> Thanks and regards
>>>>>> Yatin Karel
>>>>>>
>>>>>
>>>>
>>>> Thanks and Regards
>>>> Yatin Karel
>>>>
>>>
>>
>> --
>> Yatin Karel
>>
>
Thanks and Regards
Yatin Karel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20201229/d8cea16d/attachment-0001.html>
More information about the users
mailing list