Hi James,
On Mon, Dec 28, 2020 at 6:31 PM James Hirst <jdhirst12(a)gmail.com> wrote:
Hi Yatin,
I just deleted the overcloud and re-ran the deployment and it got stuck at
the same place, when applying puppet host configuration to the controller.
I have compared my pcsd.log file to the one you linked to and it seems
that mine has far less activity; the only requests I'm seeing received by
it are auth requests like this:
200 POST /remote/auth (10.27.0.4) 45.73ms
I see pcsd[74686]: WARNING:pcs.daemon:Caught signal: 15, shutting down,
which looks
suspicious.
Also seems you shared logs from previous runs, can you share the latest sos
report for controller, overcloud-deploy.log, or it might be due to you
using deployed servers, can you clean those as well before retry.
I have attached the logs here: ansible.log
<
https://drive.google.com/file/d/1eN15ZJzn_4GesrqTT2OcYpCigAK8rvnE/view?us...
controller
log bundle
<
https://drive.google.com/file/d/1cTC3lPRf3wjYB5SW1dNqlp--_038GInj/view?us...
(note:
ansible.log does just end without an error during the deployment as I
didn't wait for it to retry 1100 times so I CTRL+C'd it.)
Are there any other logs I should gather?
Thanks,
James H
On Mon, 28 Dec 2020 at 13:29, YATIN KAREL <yatinkarel(a)gmail.com> wrote:
> Hi James,
>
> On Mon, Dec 28, 2020 at 4:28 PM James Hirst <jdhirst12(a)gmail.com> wrote:
>
>> Hi Yatin,
>>
>> Thank you for the confirmation! I re-enabled the pacemaker and haproxy
>> roles and I have been since digging into why HA has been failing and I am
>> seeing the following:
>>
>> 1. pacemaker.service won't start due to Corosync not running.
>> 2. Corosync seems to be failing to start due to not having the
>> /etc/corosync/corosync.conf file as it does not exist.
>> 3. The pcsd log file shows the following errors:
>> ---
>> Config files sync started
>> Config files sync skipped, this host does not seem to be in a cluster of
>> at least 2 nodes
>> ---
>> This is what originally led me to believe that it wouldn't work without
>> a proper HA environment with 3 nodes.
>>
>> That is not fatal, i see same in job logs:-
>
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/o...
>
> The overcloud deployment itself simply times out at "Wait for puppet host
>> configuration to finish". I saw that step_1 seems to be where things are
>> failing (due to pacemaker), and when running it manually, I am seeing the
>> following messages:
>>
>> ---
>> Debug: Executing: '/bin/systemctl is-enabled -- corosync'
>> Debug: Executing: '/bin/systemctl is-enabled -- pacemaker'
>> Debug: Executing: '/bin/systemctl is-active -- pcsd'
>> Debug: Executing: '/bin/systemctl is-enabled -- pcsd'
>> Debug: Exec[check-for-local-authentication](provider=posix): Executing
>> check '/sbin/pcs status pcsd controller 2>&1 | grep 'Unable to
>> authenticate''
>> Debug: Executing: '/sbin/pcs status pcsd controller 2>&1 | grep
'Unable
>> to authenticate''
>> Debug:
>> /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]:
>> '/bin/echo 'local pcsd auth failed, triggering a
reauthentication'' won't
>> be executed because of failed check 'onlyif'
>> Debug:
>> /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]:
>> '/sbin/pcs host auth
controller.cloud.hirstgroup.net -u hacluster -p
>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check
'refreshonly'
>> Debug:
>> /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]:
>> '/sbin/pcs host auth
controller.cloud.hirstgroup.net -u hacluster -p
>> oaJOCgGDxRfJ1dLK' won't be executed because of failed check
'refreshonly'
>> Debug: Exec[wait-for-settle](provider=posix): Executing check '/sbin/pcs
>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum'
>
>> /dev/null 2>&1'
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>> Error: error running crm_mon, is pacemaker running?
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>> Could not connect to the CIB: Transport endpoint is not connected
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless:
>> crm_mon: Error: cluster is not available on this node
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Exec try 1/360
>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum'
>
>> /dev/null 2>&1'
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Sleeping for 10.0 seconds between tries
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Exec try 2/360
>> Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs
>> status | grep -q 'partition with quorum' > /dev/null 2>&1'
>> Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum'
>
>> /dev/null 2>&1'
>> Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns:
>> Sleeping for 10.0 seconds between tries
>> ---
>>
>> How does the corosync.conf file get created? Is it related to the pcsd
>> error saying that config sync can't proceed due to the cluster not having a
>> minimum of two members?
>>
>> No that's not related as per pcsd.log shared above. AFAIK corosync.conf
> is created by pcs daemon itself by default when pcsd is used.
> You tried that on an already deployed overcloud? If that's just a test
> setup try with overcloud delete and fresh install as i am not sure how well
> a re deployment with HA enable/disable works. Also share full logs with
> this as that will give some hint and also share what docs/steps you are
> using to see if some customization is being done. Also on your current
> failure /var/log/pcsd/pcsd.log.txt.gz on controller node should also have
> some details wrt failure.
>
> Thanks,
>> James H
>>
>> On Mon, 28 Dec 2020 at 11:26, YATIN KAREL <yatinkarel(a)gmail.com> wrote:
>>
>>> Hi James,
>>>
>>> On Sun, Dec 27, 2020 at 4:04 PM James Hirst <jdhirst12(a)gmail.com>
>>> wrote:
>>>
>>>> HI All,
>>>>
>>>> I am attempting to set up a single controller overcloud with tripleo
>>>> Victoria. I keep running into issues where pcsd is attempting to be
started
>>>> in puppet step 1 on the controller and it fails. I attempted to solve
this
>>>> by simply removing the pacemaker service from my roles_data.yaml file,
but
>>>> then I ran into other errors requiring that the pacemaker service be
>>>> enabled.
>>>>
>>>> HA Deployment is enabled by default since Ussuri release[1] with [2].
>>> So pacemaker will be deployed by default whether you set up 1 or more
>>> controller nodes since Ussuri. Without pacemaker deployment is possible but
>>> would need more changes(apart from removing pacemaker from roles_data.yaml
>>> file), like adjusting resource_registry to use non pacemaker resources. HA
>>> with 1 Controller works fine as we have green jobs[3][4] running with both
>>> 1 controller/3 controllers, so would recommend to look why pcsd is failing
>>> for you and proceed with HA. But if you still want to go without pacemaker
>>> then can try adjusting resource-registry to enable/disable pacemaker
>>> resources
>>>
>>>
>>>> I have ControllerCount set to 1, which according to the docs is all I
>>>> need to do to tell tripleo that I'm not using HA.
>>>>
>>>> Docs might be outdated if it specifies just setting ControllerCount to
>>> 1 is enough to deploy without a pacemaker, you can report a bug or send a
>>> patch to fix that with the docs link you using.
>>>
>>>
>>> Thanks,
>>>> James H
>>>> _______________________________________________
>>>> users mailing list
>>>> users(a)lists.rdoproject.org
>>>>
http://lists.rdoproject.org/mailman/listinfo/users
>>>>
>>>> To unsubscribe: users-unsubscribe(a)lists.rdoproject.org
>>>>
>>>
>>>
>>> [1]
>>>
https://docs.openstack.org/releasenotes/tripleo-heat-templates/ussuri.htm...
>>> [2]
>>>
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/359060
>>> [3]
>>>
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/o...
>>> [4]
>>>
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/o...
>>>
>>>
>>> Thanks and regards
>>> Yatin Karel
>>>
>>
>
> Thanks and Regards
> Yatin Karel
>