VMFS Recovery
Recovers data from VMware ESXi servers

What Causes VMFS Volume Corruption?

Most common Real World Scenarios and How to Prevent Them




1. ESXi Host Crashes or Kernel Panics

A sudden ESXi crash during metadata updates is one of the most frequent causes of datastore corruption.

Typical triggers


  • PSOD (Purple Screen of Death)

  • Out-of-memory conditions

  • Faulty drivers

  • Hardware compatibility issues after upgrades

Prevention


  • Keep ESXi on the HCL for your hardware.

  • Avoid unsupported NIC/storage drivers.

  • Check /var/log/vmkernel.log regularly for repeated errors.

Recovery Tips


If only the host crashed, other hosts may still see the datastore. Try:

esxcli storage filesystem list esxcli storage core adapter rescan --all

If metadata is damaged, the volume may not mount. At that point you may need scanning with a recovery tool.

2. SAN / RAID Controller Failures

A failing SAN controller, degraded cache module, or bad firmware can cause silent metadata corruption.

Symptoms


  • ESXi logs SCSI reservation conflicts

  • Datastore flickering between “normal” and “inaccessible”

  • Sudden read-only mode

Prevention


  • Enable battery-backed write cache.

  • Keep SAN firmware updated only with vendor-approved versions.

  • Replace suspicious controllers immediately (don’t wait for a full failure).

Recovery Tips


Check if individual LUNs appear correctly:

esxcfg-scsidevs -l

If the SAN corrupted block-level structures (VBLOCKs, SF nodes), automatic mounting will not work. In that situation, VMFS Recovery may be the last resort.

3. Power Loss During Metadata Updates

VMFS uses journaling, but a datastore update interrupted at the wrong moment can still result in structural damage.

At-Risk Operations


  • Creating/deleting snapshots

  • Expanding a VMDK

  • Creating a new datastore/partition

  • Extending an existing datastore

Prevention


  • Use UPSes with graceful shutdown.

  • Avoid running datastore modifications on unstable power grids.

4. Overuse or Mismanagement of VM Snapshots

Snapshots are useful—but abused snapshots are one of the fastest ways to break VMFS volumes and VMDK chains.

Problems Snapshots Can Cause


  • Delta files growing beyond 2 TB

  • Excessive consolidation load corrupting metadata

  • Dead VMDK chains after failed snapshot delete

Prevention


  • Never keep snapshots longer than 3–7 days.

  • Check for hidden auto-snapshots from backup software.

  • Monitor xxx-000001.vmdk growth.

Recovery Tips


Sometimes the VMFS volume is fine, but snapshots are corrupted. You can manually rebuild VMDK chains for free using descriptor files, unless the underlying VMFS allocation map is damaged.

5. Disk Firmware Bugs / SSD Write Hole

  • Consumer-grade SSDs and HDDs can corrupt data during cache flush failures, sudden power loss, or controller bugs.

Prevention


  • Use enterprise disks on the VMware HCL.

  • Enable array-level end-to-end checksums.

6. Failed RAID Rebuilds or UREs

  • RAID rebuilds are notorious for triggering corruption, especially RAID-5 on large disks.

Symptoms


  • VMFS mounts but some files are unreadable

  • VMDK contains bad sectors

  • The datastore disappears after rebuild

Prevention


  • Prefer RAID-10 for virtualization workloads.

  • Replace failing drives promptly.

  • Avoid mixing different drive models.

7. LUN Misconfiguration or Duplicate UUIDs

If two hosts present the same LUN with different access modes, metadata can be overwritten.

Prevention


  • Use consistent zoning and LUN masking.

  • After cloning a LUN, always run

    esxcli storage vmfs snapshot list

    and resignature it.

8. Faulty or Misconfigured Multipathing (MPIO)

  • Incorrect path failover may cause incomplete writes, path thrashing, or delayed metadata commits.

Prevention


  • Use VMware-recommended PSP (Path Selection Policies).

  • Ensure all ESXi hosts use the same multipath config.

9. VMFS Upgrades or ESXi Patching Failures

Interruptions during upgrade from VMFS-3 → VMFS-5 → VMFS-6 can break the volume entirely.

Prevention


  • Never upgrade large datastores on low-power UPS.

  • Ensure storage paths are stable during upgrade.

10. Human Errors

  • Accidentally overwriting a datastore with a new VMFS signature

  • Deleting a datastore from the wrong host

  • Expanding a datastore onto the wrong device

Prevention


  • Label physical volumes clearly.

  • Require 2-step confirmation before destructive actions.

11. Bad Sectors on LUN or Physical Disk

Even with RAID, uncorrectable read errors can corrupt metadata structures stored on a single block.

Prevention


  • Monitor SMART and RAID controller logs.

  • Use disks with power-loss protection.

How to Recover a VMFS Volume for Free (Before Using Any Tools)

  1. Force rescan

    esxcli storage core adapter rescan --all


  2. List VMFS snapshots

    esxcli storage vmfs snapshot list


  3. Attempt a VMFS mount

    esxcfg-volume -M <UUID>


  4. Try opening VMDK files directly (download via SCP and inspect descriptor headers manually).

  5. Mount as a secondary datastore on another host.

When Manual Methods Are Not Enough

If the VMFS allocation map is damaged, metadata nodes are missing, VMDK chains are lost, or the datastore shows “uninitialized,” automated tools become the only practical option.

At this stage, VMFS Recovery is designed to reconstruct damaged VMFS 6/5/3 metadata, read VMDK fragments, and rebuild virtual disk chains even when ESXi cannot.

Final Thoughts

VMFS corruption almost always has a root cause—SAN instability, ESXi crashes, RAID issues, or simply snapshot misuse. With proactive monitoring and careful hardware selection, most cases can be avoided entirely.

But when the datastore is already damaged, recovering your VMs is the priority. Start with free methods, avoid writing anything to the affected LUN, and only then use specialized recovery tools if needed.




Return to contents