[rdo-dev] RDO | OpenStack VMs | XFS Metadata Corruption

Pradeep Antil pradeepantil at gmail.com
Tue Apr 21 16:09:02 UTC 2020


Hi Ruslanas / Donny,

We did try the below suggested steps on affected VMs but it doesn't help to
boot VM properly, we still get XFS corruption errors on VM's Console.

VMs fails on random compute nodes. Let's suppose VM fails on Compute-8 then
if we try to recreate a new VM using same image and flavor then subsequent
attempts on VM creation for that Compute via same image will always fail
unless and until we delete the problematic VM and clean image cache from
compute node.

We always notice that on some compute nodes VM is provisioned successfully
whereas it got failed on some random compute nodes. If we assume that
issue  with Image then we should have XFS errors on every VM but it is not
the case.

We also try to provision around 6 VMs using CentOS and RHEL Cloud images,
for those images we don't see XFS related errors on VM's Console.

KO1A3D02O131006CM03:/var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719$
LANG=C qemu-img info
/var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

image: /var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

file format: qcow2

virtual size: 60G (64424509440 bytes)

disk size: 26M

cluster_size: 65536

backing file:
/var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

Format specific information:

    compat: 1.1

    lazy refcounts: false

    refcount bits: 16

    corrupt: false



KO1A3D02O131006CM03:/var/lib/nova/instances/_base$ LANG=C qemu-img info
/var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

image:
/var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

file format: raw

virtual size: 20G (21474836480 bytes)

disk size: 5.1G

ericsson at KO1A3D02O131006CM03:/var/lib/nova/instances/_base$



2.    Fsck output



root at KO1A3D02O131006CM03:~# fsck
/var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c

fsck from util-linux 2.27.1

e2fsck 1.42.13 (17-May-2015)

ext2fs_open2: Bad magic number in super-block

fsck.ext2: Superblock invalid, trying backup blocks...

fsck.ext2: Bad magic number in super-block while trying to open
/var/lib/nova/instances/_base/86692cd1e738b8df7cf1f951967c61e92222fc4c



The superblock could not be read or does not describe a valid ext2/ext3/ext4

filesystem.  If the device is valid and it really contains an ext2/ext3/ext4

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate superblock:

    e2fsck -b 8193 <device>

or

    e2fsck -b 32768 <device>



root at KO1A3D02O131006CM03:~# fsck
/var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk

fsck from util-linux 2.27.1

e2fsck 1.42.13 (17-May-2015)

ext2fs_open2: Bad magic number in super-block

fsck.ext2: Superblock invalid, trying backup blocks...

fsck.ext2: Bad magic number in super-block while trying to open
/var/lib/nova/instances/77e9d09f-ef12-4e65-aa40-256084597719/disk



The superblock could not be read or does not describe a valid ext2/ext3/ext4

filesystem.  If the device is valid and it really contains an ext2/ext3/ext4

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate superblock:

    e2fsck -b 8193 <device>

or

    e2fsck -b 32768 <device>



These are the only flags available for virsh, there is no flag with
domblkshow.
[image: image.png]

[image: image.png]





On Tue, Apr 21, 2020 at 3:35 PM Ruslanas Gžibovskis <ruslanas at lpic.lt>
wrote:
does it fail always on same computes?
once you have failed instance, have you tried:
openstack server show uuid
find instance name
then login to compute,
virsh domblkshow instance-######
then you will find out, that your instance disk is in
/var/lib/nova/instances/(instance_uuid)/disk...
for example fromyour log:
LANG=C qemu-img info /var/lib/nova/instances/dfa80e78-ee02-46e5-ba7a-
0874fa37da56/disk
literally this command
to see which base file it uses, but it should be:
LANG=C qemu-img info /var/lib/nova/instances/_base/image_uuid

and do fscheck on those two :)) copy them first :)) and run fsck on the
copy :)))

i would check do you have space on that dir: df -h /var/lib/nova/instances/

also if it has space.check that base image which is used by that qcow2

On Tue, Apr 21, 2020 at 4:44 PM Donny Davis <donny at fortnebula.com> wrote:

>
>
> On Tue, Apr 21, 2020 at 4:26 AM Pradeep Antil <pradeepantil at gmail.com>
> wrote:
>
>> Hi Ruslanas / Openstack Gurus,
>>
>> Please find the response inline below:
>>
>> *is it the same image all the time?* -- Yes, we are using same image but
>> image size is around 6GB and recently we have an oobersavtion that VMs are
>> successfully spawned on some compute nodes but randomly failing on certain
>> compute hosts. It is also observed that in nova compute logs, image is
>> attempted to be resize. please refer the below snapshot,
>>
>> 2020-04-20 19:03:27.067 150243 DEBUG oslo_concurrency.processutils
>> [req-1caea4a2-7cf0-4ba5-9dda-2bb90bb746d8 cbabd9368dc24fea84fd2e43935fddfa
>> 975a7d3840a141b0a20a9dc60e3da6cd - default default] Running cmd
>> (subprocess): qemu-img resize
>> /var/lib/nova/instances/616b1a27-8b8c-486b-b8db-57c7b91a7402/disk
>> 64424509440 execute
>> /openstack/venvs/nova-17.1.12/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372
>> 2020-04-20 19:03:27.124 150243 DEBUG oslo_concurrency.processutils
>> [req-1caea4a2-7cf0-4ba5-9dda-2bb90bb746d8 cbabd9368dc24fea84fd2e43935fddfa
>> 975a7d3840a141b0a20a9dc60e3da6cd - default default] CMD "qemu-img resize
>> /var/lib/nova/instances/616b1a27-8b8c-486b-b8db-57c7b91a7402/disk
>> 64424509440" returned: 0 in 0.056s execute
>> /openstack/venvs/nova-17.1.12/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409
>> 2020-04-20 19:03:27.160 150243 DEBUG nova.virt.disk.api
>> [req-1caea4a2-7cf0-4ba5-9dda-2bb90bb746d8 cbabd9368dc24fea84fd2e43935fddfa
>> 975a7d3840a141b0a20a9dc60e3da6cd - default default] Checking if we can
>> resize image
>> /var/lib/nova/instances/616b1a27-8b8c-486b-b8db-57c7b91a7402/disk.
>> size=64424509440 can_resize_image
>> /openstack/venvs/nova-17.1.12/lib/python2.7/site-packages/nova/virt/disk/api.py:216
>>
>> *try to create that instance using horizon or cli, whichever you favor
>> more.  does it boot good?* - Yes , We did try to create an instance
>> using the same image sometimes VMs spawned properly without any errors. If
>> we specify VM count let's assume 6 , on some compute node it is failing to
>> spawn VMs properly, on console we are getting XFS Metadata corruption
>> errors.
>>
>> *I would also, do cleanup of instances (remove all), and remove all
>> dependent base files from here.  rm -rf /var/lib/nova/instances/_base/ *--
>> We used to clear the image cache from all compute nodes before initiating
>> Stack Creation. Yes, used the same rm command to clear cache.
>>
>> I Just want let you know one more thing in my setup, my glance file
>> system on comtroller are mounted on external NFS share with the following
>> parameters,
>>
>> [image: image.png]
>> [image: image.png]
>>
>> Any pointers or suggestions to resolve this issue.
>>
>>
>>
>> On Tue, Apr 21, 2020 at 11:37 AM Ruslanas Gžibovskis <ruslanas at lpic.lt>
>> wrote:
>>
>>> is it the same image all the time?
>>>
>>> try to create that instance using horizon or cli, whichever you favor
>>> more.  does it boot good?
>>>
>>> I would also, do cleanup of instances (remove all), and remove all
>>> dependent base files from here.  rm -rf /var/lib/nova/instances/_base/
>>>
>>>
>>>
>>>
>>> On Thu, 16 Apr 2020 at 19:08, Pradeep Antil <pradeepantil at gmail.com>
>>> wrote:
>>>
>>>> Hi Techies,
>>>>
>>>> I have below RDO setup,
>>>>
>>>>    - RDO 13
>>>>    - Base OS for Controllers & Compute is Ubuntu
>>>>    - Neutron with vxlan + VLAN (for provider N/W)
>>>>    - Cinder backend is CHEF
>>>>    - HugePages and CPU Pinning for VNF's VMs
>>>>
>>>> I am trying to deploy a stack which is suppose to create 18 VMs  across
>>>> 11 computes nodes internal disk, but every time 3 to 4 VMs out of 18
>>>> doesn't spawned properly. At console of these VMs i am getting below
>>>> errors,
>>>>
>>>> Any idea and suggestion how to troubleshoot this? and resolve the
>>>> issue.
>>>>
>>>> [  100.681552] ffff8b37f8f86020: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  100.681553] ffff8b37f8f86030: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  100.681560] XFS (vda1): Metadata corruption detected at
>>>> xfs_inode_buf_verify+0x79/0x100 [xfs], xfs_inode block 0x179b800
>>>> [  100.681561] XFS (vda1): Unmount and run xfs_repair
>>>> [  100.681561] XFS (vda1): First 64 bytes of corrupted metadata buffer:
>>>> [  100.681562] ffff8b37f8f86000: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  100.681562] ffff8b37f8f86010: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  100.681563] ffff8b37f8f86020: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  100.681564] ffff8b37f8f86030: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  100.681596] XFS (vda1): metadata I/O error: block 0x179b800
>>>> ("xfs_trans_read_buf_map") error 117 numblks 32
>>>> [  100.681599] XFS (vda1): xfs_imap_to_bp: xfs_trans_read_buf()
>>>> returned error -117.
>>>> [   99.585766] [cloud-init[32m  OK  [0m[2530]: ] Cloud-init v. 18.2
>>>> running 'init-local' at Thu, 16 Apr 2020 10:44:21 +0000. Up 99.55
>>>> seconds.Started oVirt Guest Agent.
>>>>
>>>> [  101.086566] XFS (vda1): Metadata corruption detected at
>>>> xfs_inode_buf_verify+0x79/0x100 [xfs], xfs_inode block 0x179b800
>>>> [  101.092093] XFS (vda1): Unmount and run xfs_repair
>>>> [  101.094660] XFS (vda1): First 64 bytes of corrupted metadata buffer:
>>>> [  101.097787] ffff8b37fef07000: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> [  101.105959] ffff8b37fef07010: 00 00 00 00 00 00 00 00 00 00 00 00 00
>>>> 00 00 00  ................
>>>> --
>>>> Best Regards
>>>> Pradeep Kumar
>>>>
>>>
>>>
>>> --
>>> Ruslanas Gžibovskis
>>> +370 6030 7030
>>>
>>
>>
>> --
>> Best Regards
>> Pradeep Kumar
>> _______________________________________________
>> dev mailing list
>> dev at lists.rdoproject.org
>> http://lists.rdoproject.org/mailman/listinfo/dev
>>
>> To unsubscribe: dev-unsubscribe at lists.rdoproject.org
>>
>
>
> Have you tried this with any other images? Maybe pickup a fresh image from
> https://cloud.centos.org/centos/7/images/ and run the same test again
> where you launch 6 instances.
>
> --
> ~/DonnyD
> C: 805 814 6800
> "No mission too difficult. No sacrifice too great. Duty First"
>


-- 
Best Regards
Pradeep Kumar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20200421/65c3944a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 8053 bytes
Desc: not available
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20200421/65c3944a/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 12691 bytes
Desc: not available
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20200421/65c3944a/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 9834 bytes
Desc: not available
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20200421/65c3944a/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 931296 bytes
Desc: not available
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20200421/65c3944a/attachment-0007.png>


More information about the dev mailing list