On 01/28/2016 07:12 AM, Mohammed Arafa wrote:
hi all
i am attempting to build a 2 node basic overcloud. my previous emails have
been talking about the problems i encountered.
what i have :
- 1 vm called rdo with undercloud AND overcloud. this one is has not been
updated since november and i keep restoring snapshots to that date.
- a 2nd vm called rdo2, full updated, overcloud fails to deploy to a
specific physical node
observations: (unscientific!)
the 2 physical nodes are both good. i tested by redeploying on rdo again
and again. i even swapped their order in instackenv.json and redeploying
succesfully from instackenv.json step.
however i have a particular machine that refuses to deploy. it doesnt
matter what order. if it is the controller, it fails, if it is the compute
it fails.
i am using the same flavour on both rdo vms. but again, i believe i have
ruled out that variable.
how far did i reach?
over the past few days i have opened the console and watched this
particular machine pxe boot, get an ip, reboot, change its hostname to
reflect the ip, reboot to localhost.localdomain (?) and the power off. i am
not saying i sat down and watched it for the entire 209 minutes but i have
observed it unscientifically
last error:
Deploying templates in the directory
/usr/share/openstack-tripleo-heat-templates
Stack failed with status: Resource CREATE failed: resources.Controller:
ResourceInError: resources[0].resources.Controller: Went to status ERROR
due to "Message: No valid host was found. There are not enough hosts
available., Code: 500"
Heat Stack create failed.
real 209m19.252s
user 0m21.695s
sys 0m2.402s
what am i looking for?:
what do i look for in the logs? and my logs are huge; they dont get rotated
for some reason
i would like to know the reason this particular physical machine refuses to
deploy, so i can fix it. i believe i have eliminated all variables except
the machine itself and it has me puzzled and frustrated as i need to move
on to the next stage of network isolation.
any ideas?
The point at which it is failing seems to be before the node is fully
deployed. Which is to say, before we start doing puppet applies on it to
configure it.
This is a helpful distinction, because we can limit the search space for
possible issues. This is almost certainly a Nova/Ironic issue. The best
log to look at for Nova in this case would be the scheduler log at
/var/log/nova/nova-scheduler.log, while the best log to look at for
Ironic would be the conductor log at /var/log/ironic/ironic-conductor.log.
If your logs are very large, it may be better to delete them, and
reproduce the issue in order to further limit the search space. Note,
that the issue is most likely reproduced within the first 30min of that
test, so you wont need to wait for the full 200+ which I am guessing
just hits the deploy timeout.
thanks
_______________________________________________
Rdo-list mailing list
Rdo-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/rdo-list
To unsubscribe: rdo-list-unsubscribe(a)redhat.com