What Causes VMFS Volume Corruption?

Most common Real World Scenarios and How to Prevent Them

1. ESXi Host Crashes or Kernel Panics

A sudden ESXi crash during metadata updates is one of the most frequent causes of datastore corruption.

Typical triggers

PSOD (Purple Screen of Death)

Out-of-memory conditions

Faulty drivers

Hardware compatibility issues after upgrades

Prevention

Keep ESXi on the HCL for your hardware.

Avoid unsupported NIC/storage drivers.

Check /var/log/vmkernel.log regularly for repeated errors.

Recovery Tips

If only the host crashed, other hosts may still see the datastore. Try:

esxcli storage filesystem list esxcli storage core adapter rescan --all

If metadata is damaged, the volume may not mount. At that point you may need scanning with a recovery tool.

2. SAN / RAID Controller Failures

A failing SAN controller, degraded cache module, or bad firmware can cause silent metadata corruption.

Symptoms

ESXi logs SCSI reservation conflicts

Datastore flickering between “normal” and “inaccessible”

Sudden read-only mode

Prevention

Enable battery-backed write cache.

Keep SAN firmware updated only with vendor-approved versions.

Replace suspicious controllers immediately (don’t wait for a full failure).

Recovery Tips

Check if individual LUNs appear correctly:

esxcfg-scsidevs -l

If the SAN corrupted block-level structures (VBLOCKs, SF nodes), automatic mounting will not work. In that situation, VMFS Recovery may be the last resort.

3. Power Loss During Metadata Updates

VMFS uses journaling, but a datastore update interrupted at the wrong moment can still result in structural damage.

At-Risk Operations

Creating/deleting snapshots

Expanding a VMDK

Creating a new datastore/partition

Extending an existing datastore

Prevention

Use UPSes with graceful shutdown.

Avoid running datastore modifications on unstable power grids.

4. Overuse or Mismanagement of VM Snapshots

Snapshots are useful—but abused snapshots are one of the fastest ways to break VMFS volumes and VMDK chains.

Problems Snapshots Can Cause

Delta files growing beyond 2 TB

Excessive consolidation load corrupting metadata

Dead VMDK chains after failed snapshot delete

Prevention

Never keep snapshots longer than 3–7 days.

Check for hidden auto-snapshots from backup software.

Monitor xxx-000001.vmdk growth.

Recovery Tips

Sometimes the VMFS volume is fine, but snapshots are corrupted. You can manually rebuild VMDK chains for free using descriptor files, unless the underlying VMFS allocation map is damaged.

5. Disk Firmware Bugs / SSD Write Hole

Consumer-grade SSDs and HDDs can corrupt data during cache flush failures, sudden power loss, or controller bugs.

Prevention

Use enterprise disks on the VMware HCL.

Enable array-level end-to-end checksums.

6. Failed RAID Rebuilds or UREs

RAID rebuilds are notorious for triggering corruption, especially RAID-5 on large disks.

Symptoms

VMFS mounts but some files are unreadable

VMDK contains bad sectors

The datastore disappears after rebuild

Prevention

Prefer RAID-10 for virtualization workloads.

Replace failing drives promptly.

Avoid mixing different drive models.

7. LUN Misconfiguration or Duplicate UUIDs

If two hosts present the same LUN with different access modes, metadata can be overwritten.

Prevention

Use consistent zoning and LUN masking.

After cloning a LUN, always run
esxcli storage vmfs snapshot list
and resignature it.

8. Faulty or Misconfigured Multipathing (MPIO)

Incorrect path failover may cause incomplete writes, path thrashing, or delayed metadata commits.

Prevention

Use VMware-recommended PSP (Path Selection Policies).

Ensure all ESXi hosts use the same multipath config.

9. VMFS Upgrades or ESXi Patching Failures

Interruptions during upgrade from VMFS-3 → VMFS-5 → VMFS-6 can break the volume entirely.

Prevention

Never upgrade large datastores on low-power UPS.

Ensure storage paths are stable during upgrade.

10. Human Errors

Accidentally overwriting a datastore with a new VMFS signature

Deleting a datastore from the wrong host

Expanding a datastore onto the wrong device

Prevention

Label physical volumes clearly.

Require 2-step confirmation before destructive actions.

11. Bad Sectors on LUN or Physical Disk

Even with RAID, uncorrectable read errors can corrupt metadata structures stored on a single block.

Prevention

Monitor SMART and RAID controller logs.

Use disks with power-loss protection.

How to Recover a VMFS Volume for Free (Before Using Any Tools)

Force rescan
esxcli storage core adapter rescan --all

List VMFS snapshots
esxcli storage vmfs snapshot list

Attempt a VMFS mount
esxcfg-volume -M <UUID>

Try opening VMDK files directly (download via SCP and inspect descriptor headers manually).

Mount as a secondary datastore on another host.

When Manual Methods Are Not Enough

If the VMFS allocation map is damaged, metadata nodes are missing, VMDK chains are lost, or the datastore shows “uninitialized,” automated tools become the only practical option.

At this stage, VMFS Recovery is designed to reconstruct damaged VMFS 6/5/3 metadata, read VMDK fragments, and rebuild virtual disk chains even when ESXi cannot.

Final Thoughts

VMFS corruption almost always has a root cause—SAN instability, ESXi crashes, RAID issues, or simply snapshot misuse. With proactive monitoring and careful hardware selection, most cases can be avoided entirely.

But when the datastore is already damaged, recovering your VMs is the priority. Start with free methods, avoid writing anything to the affected LUN, and only then use specialized recovery tools if needed.

Return to contents