Hi Ruslanas / Donny,

We did try the below suggested steps on affected VMs but it doesn't help to boot VM properly, we still get XFS corruption errors on VM's Console.

VMs fails on random compute nodes. Let's suppose VM fails on Compute-8 then if we try to recreate a new VM using same image and flavor then subsequent attempts on VM creation for that Compute via same image will always fail unless and until we delete the problematic VM and clean image cache from compute node.

We always notice that on some compute nodes VM is provisioned successfully whereas it got failed on some random compute nodes. If we assume that issue  with Image then we should have XFS errors on every VM but it is not the case.

We also try to provision around 6 VMs using CentOS and RHEL Cloud images, for those images we don't see XFS related errors on VM's Console.

KO1A3D02O131006CM03:/var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719$ LANG=C qemu-img info /var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

image: /var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

file format: qcow2

virtual size: 60G (64424509440 bytes)

disk size: 26M

cluster_size: 65536

backing file: /var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

Format specific information:

    compat: 1.1

    lazy refcounts: false

    refcount bits: 16

    corrupt: false

     

KO1A3D02O131006CM03:/var/lib/nova/instances/_base$ LANG=C qemu-img info /var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

image: /var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

file format: raw

virtual size: 20G (21474836480 bytes)

disk size: 5.1G

ericsson@KO1A3D02O131006CM03:/var/lib/nova/instances/_base$

 

2.    Fsck output

 

root@KO1A3D02O131006CM03:~# fsck /var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

fsck from util-linux 2.27.1

e2fsck 1.42.13 (17-May-2015)

ext2fs_open2: Bad magic number in super-block

fsck.ext2: Superblock invalid, trying backup blocks...

fsck.ext2: Bad magic number in super-block while trying to open /var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

 

The superblock could not be read or does not describe a valid ext2/ext3/ext4

filesystem.  If the device is valid and it really contains an ext2/ext3/ext4

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate superblock:

    e2fsck -b 8193 <device>

or

    e2fsck -b 32768 <device>

 

root@KO1A3D02O131006CM03:~# fsck /var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

fsck from util-linux 2.27.1

e2fsck 1.42.13 (17-May-2015)

ext2fs_open2: Bad magic number in super-block

fsck.ext2: Superblock invalid, trying backup blocks...

fsck.ext2: Bad magic number in super-block while trying to open /var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

 

The superblock could not be read or does not describe a valid ext2/ext3/ext4

filesystem.  If the device is valid and it really contains an ext2/ext3/ext4

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate superblock:

    e2fsck -b 8193 <device>

or

    e2fsck -b 32768 <device>

 

These are the only flags available for virsh, there is no flag with domblkshow.

image.png

image.png





On Tue, Apr 21, 2020 at 3:35 PM Ruslanas Gžibovskis <ruslanas@lpic.lt> wrote:
does it fail always on same computes? 
once you have failed instance, have you tried:
openstack server show uuid 
find instance name
then login to compute, 
virsh domblkshow instance-######
then you will find out, that your instance disk is in /var/lib/nova/instances/(instance_uuid)/disk...
for example fromyour log: 
LANG=C qemu-img info /var/lib/nova/instances/dfa80e78-ee02-46e5-ba7a-0874fa37da56/disk
literally this command
to see which base file it uses, but it should be:
LANG=C qemu-img info /var/lib/nova/instances/_base/image_uuid

and do fscheck on those two :)) copy them first :)) and run fsck on the copy :)))

i would check do you have space on that dir: df -h /var/lib/nova/instances/

also if it has space.check that base image which is used by that qcow2

On Tue, Apr 21, 2020 at 4:44 PM Donny Davis <donny@fortnebula.com> wrote:


On Tue, Apr 21, 2020 at 4:26 AM Pradeep Antil <pradeepantil@gmail.com> wrote:
Hi Ruslanas / Openstack Gurus,

Please find the response inline below:

is it the same image all the time? -- Yes, we are using same image but image size is around 6GB and recently we have an oobersavtion that VMs are successfully spawned on some compute nodes but randomly failing on certain compute hosts. It is also observed that in nova compute logs, image is attempted to be resize. please refer the below snapshot,

2020-04-20 19:03:27.067 150243 DEBUG oslo_concurrency.processutils [req-1caea4a2-7cf0-4ba5-9dda-2bb90bb746d8 cbabd9368dc24fea84fd2e43935fddfa 975a7d3840a141b0a20a9dc60e3da6cd - default default] Running cmd (subprocess): qemu-img resize /var/lib/nova/instances/616b1a27-8b8c-486b-b8db-57c7b91a7402/disk 64424509440 execute /openstack/venvs/nova-17.1.12/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372 2020-04-20 19:03:27.124 150243 DEBUG oslo_concurrency.processutils [req-1caea4a2-7cf0-4ba5-9dda-2bb90bb746d8 cbabd9368dc24fea84fd2e43935fddfa 975a7d3840a141b0a20a9dc60e3da6cd - default default] CMD "qemu-img resize /var/lib/nova/instances/616b1a27-8b8c-486b-b8db-57c7b91a7402/disk 64424509440" returned: 0 in 0.056s execute /openstack/venvs/nova-17.1.12/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 2020-04-20 19:03:27.160 150243 DEBUG nova.virt.disk.api [req-1caea4a2-7cf0-4ba5-9dda-2bb90bb746d8 cbabd9368dc24fea84fd2e43935fddfa 975a7d3840a141b0a20a9dc60e3da6cd - default default] Checking if we can resize image /var/lib/nova/instances/616b1a27-8b8c-486b-b8db-57c7b91a7402/disk. size=64424509440 can_resize_image /openstack/venvs/nova-17.1.12/lib/python2.7/site-packages/nova/virt/disk/api.py:216

try to create that instance using horizon or cli, whichever you favor more.  does it boot good? - Yes , We did try to create an instance using the same image sometimes VMs spawned properly without any errors. If we specify VM count let's assume 6 , on some compute node it is failing to spawn VMs properly, on console we are getting XFS Metadata corruption errors.

I would also, do cleanup of instances (remove all), and remove all dependent base files from here.  rm -rf /var/lib/nova/instances/_base/ -- We used to clear the image cache from all compute nodes before initiating Stack Creation. Yes, used the same rm command to clear cache.

I Just want let you know one more thing in my setup, my glance file system on comtroller are mounted on external NFS share with the following parameters,

image.png
image.png

Any pointers or suggestions to resolve this issue.



On Tue, Apr 21, 2020 at 11:37 AM Ruslanas Gžibovskis <ruslanas@lpic.lt> wrote:
is it the same image all the time?

try to create that instance using horizon or cli, whichever you favor more.  does it boot good?

I would also, do cleanup of instances (remove all), and remove all dependent base files from here.  rm -rf /var/lib/nova/instances/_base/ 




On Thu, 16 Apr 2020 at 19:08, Pradeep Antil <pradeepantil@gmail.com> wrote:
Hi Techies,

I have below RDO setup,
  • RDO 13
  • Base OS for Controllers & Compute is Ubuntu
  • Neutron with vxlan + VLAN (for provider N/W)
  • Cinder backend is CHEF
  • HugePages and CPU Pinning for VNF's VMs
I am trying to deploy a stack which is suppose to create 18 VMs  across 11 computes nodes internal disk, but every time 3 to 4 VMs out of 18 doesn't spawned properly. At console of these VMs i am getting below errors,

Any idea and suggestion how to troubleshoot this? and resolve the issue.

[  100.681552] ffff8b37f8f86020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  100.681553] ffff8b37f8f86030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  100.681560] XFS (vda1): Metadata corruption detected at xfs_inode_buf_verify+0x79/0x100 [xfs], xfs_inode block 0x179b800
[  100.681561] XFS (vda1): Unmount and run xfs_repair
[  100.681561] XFS (vda1): First 64 bytes of corrupted metadata buffer:
[  100.681562] ffff8b37f8f86000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  100.681562] ffff8b37f8f86010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  100.681563] ffff8b37f8f86020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  100.681564] ffff8b37f8f86030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  100.681596] XFS (vda1): metadata I/O error: block 0x179b800 ("xfs_trans_read_buf_map") error 117 numblks 32
[  100.681599] XFS (vda1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -117.
[   99.585766] [cloud-init[32m  OK  [0m[2530]: ] Cloud-init v. 18.2 running 'init-local' at Thu, 16 Apr 2020 10:44:21 +0000. Up 99.55 seconds.Started oVirt Guest Agent.

[  101.086566] XFS (vda1): Metadata corruption detected at xfs_inode_buf_verify+0x79/0x100 [xfs], xfs_inode block 0x179b800
[  101.092093] XFS (vda1): Unmount and run xfs_repair
[  101.094660] XFS (vda1): First 64 bytes of corrupted metadata buffer:
[  101.097787] ffff8b37fef07000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  101.105959] ffff8b37fef07010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

--
Best Regards
Pradeep Kumar


--
Ruslanas Gžibovskis
+370 6030 7030


--
Best Regards
Pradeep Kumar
_______________________________________________
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscribe@lists.rdoproject.org


Have you tried this with any other images? Maybe pickup a fresh image from https://cloud.centos.org/centos/7/images/ and run the same test again where you launch 6 instances. 

--
~/DonnyD
C: 805 814 6800
"No mission too difficult. No sacrifice too great. Duty First"


--
Best Regards
Pradeep Kumar