Friday, January 25, 2013

How to verify VMFS Heartbeat Region Corruption

Recently I was discussing the issue when you see some messages in the logs where the HB offset is noticed in place of the actual location of the bytes.

Upon research found out that the there is a heartbeat region corruption occurred on the VMFS volume due to the power outage or some other reasons.

You will see the following messages in the logs:


cpu2:2816)WARNING: HBX: 533: Volume 4f7217c1-3589352d-625d-001b22248a7a ("1TB-LAB") may be damaged on disk. Corrupt heartbeat detected at offset 3751936: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-00

cpu5:2980)WARNING: HBX: 533: Volume 4f7217c1-3589352d-625d-001b22248a7a ("1TB-LAB") may be damaged on disk. Corrupt heartbeat detected at offset 3751936: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-00


Now on your ESXi 5.x host run the following command on ESXi console using SSH or DCUI:

hexdump -s 22626304 -n 2048 -C /vmfs/devices/disks/

The output will be similar to this:

hexdump -s 22626304 -n 2048 -C /vmfs/devices/disks/

01594000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594200  01 ef cd ab 00 42 39 00  00 00 00 00 de 00 00 00  |.....B9.........|
01594210  00 00 00 00 cd 31 82 11  00 00 00 00 00 00 00 00  |.....1..........|
01594220  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
01594230  0e 00 00 00 36 00 00 00  00 00 00 00 00 00 00 00  |....6...........|
01594240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594400  01 ef cd ab 00 44 39 00  00 00 00 00 00 00 00 00  |.....D9.........|
01594410  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594600  01 ef cd ab 00 46 39 00  00 00 00 00 00 00 00 00  |.....F9.........|
01594610  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594800

Now if you look at the bold section in RED above you see that the 1st section is missing some value out of the 4.

After fixing the heartbeat region you can see the value as follows once you run the same command again.


hexdump -s 22626304 -n 2048 -C /vmfs/devices/disks/

01594000  01 ef cd ab 00 40 39 00  00 00 00 00 00 00 00 00  |.....@9.........|
01594010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594200  01 ef cd ab 00 42 39 00  00 00 00 00 de 00 00 00  |.....B9.........|
01594210  00 00 00 00 cd 31 82 11  00 00 00 00 00 00 00 00  |.....1..........|
01594220  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
01594230  0e 00 00 00 36 00 00 00  00 00 00 00 00 00 00 00  |....6...........|
01594240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594400  01 ef cd ab 00 44 39 00  00 00 00 00 00 00 00 00  |.....D9.........|
01594410  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594600  01 ef cd ab 00 46 39 00  00 00 00 00 00 00 00 00  |.....F9.........|
01594610  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01594800

As you can see in the above output with all Bold in RED, all 4 sections have some HEX values which is a sign of healthy VMFS heartbeat region. If you do have issue viewing the file for any VM before and it was giving an error, try again you should be able to view the file from the affected datastore.
Now to fix the same issue I would suggest to contact VMware Support if not dealing with Lab/Test environment or you are not sure how to fix the corruption. This information is just to give you an idea how to verify if there is indeed a corruption on the VMFS volume or not.

For more information refer to KB 1012036 and also refer to the blog by @VMwareStorage on VOMA which talks about the metadata corruption.


Hope you find this information useful, if yes then do share.

2 comments:

  1. Is there any chance that you can let me know how to proceed with this repair ?

    I have same issue on ESXi 4.0 no support and want to try restore one machine - multiple files locked after server crash

    Thx for help.

    ReplyDelete
  2. Hi,

    I went through this recently after an MSA P2000 G3 storage controller crash. Just about every .vmx and .vmdk was locked on all five filesystems. VMware provided me with a series of 512-byte patch files to fix the affected regions (anywhere from 1 to 3 of them per filesystem).

    They seem to guard these secrets closely. Based on the files I was sent I'm pretty sure they have an internal tool to scan the VMFS metadata and generate the patch files. I would like to understand how this works because waiting 3 days for them to fix simple heartbeat corruption is just balls when on any other filesystem it would be a simple fsck.

    How did you make the leap from from "Corrupt heartbeat detected at offset 3751936" to "hexdump -s 22626304"? 3751936 doesn't divide into 22626304 evenly by 4 or 8 or anything.

    How did you correct the region at 0x01594000? I understand 01 ef cd ab is a signature but what are the other bytes?

    Thanks,
    Mike

    ReplyDelete