Files
docs/deployments/platforms/virtualization/proxmox/Detecting and Removing Orphaned VM Disks.md
T
nicole fe804e38c6
Automatic Documentation Deployment / Sync Docs to https://kb.bunny-lab.io (push) Successful in 14s
Add deployments/platforms/virtualization/proxmox/Detecting and Removing Orphaned VM Disks.md
2026-06-04 13:38:18 -06:00

704 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Proxmox VE Shared iSCSI/LVM Orphan Disk Audit and Cleanup Procedure
## Purpose
This procedure describes how to identify and safely remove orphaned Proxmox VE virtual machine disks from shared iSCSI-backed LVM storage.
It is intended for environments where:
* Proxmox VE is clustered.
* Multiple Proxmox nodes access the same shared iSCSI LUN.
* The shared storage is exposed to Proxmox as LVM storage.
* VM disks are stored as LVM logical volumes.
* Some volumes may remain after VM disk deletion, failed migrations, failed resizes, storage UI inconsistencies, or manual recovery work.
The goal is to reclaim storage space without accidentally deleting disks that are still attached to running or stopped VMs.
---
## Scope
This document focuses on the following storage type:
```text
Proxmox storage type: lvm
Backing storage: shared iSCSI
Volume group: vg_proxmox_iscsi
Storage ID example: iscsi-cluster-lvm
```
Adjust the storage ID and volume group names as needed for your environment.
---
## Safety Requirements
!!! danger "Never delete based on the Storage UI alone"
The Proxmox storage UI may show a volume as belonging to a VM because its name follows the pattern `vm-<vmid>-disk-<n>`. That does not prove the disk is currently attached to the VM.
```
Always verify against VM configuration files and active QEMU processes before deleting.
```
!!! warning "Run the audit before running any cleanup commands"
The audit scripts in this document are read-only. The cleanup commands are destructive. Do not run cleanup commands until the audit output has been reviewed.
!!! warning "Snapshot volumes require extra caution"
Volumes named like the following may be part of a snapshot chain:
````
```text
snap_vm-<vmid>-disk-<n>_<snapshot-name>
```
Do not remove snapshot volumes manually unless you have verified that the VM and snapshot are no longer known to Proxmox, no backing chain references them, and no QEMU process has them open.
````
!!! note "Shared storage does not mean shared config"
In some cluster layouts, each node may only show VM config files for VMs assigned to that node. Therefore, an audit run from only one node can falsely report disks from other nodes as orphaned.
```
Run the confirmation script on every node in the cluster.
```
---
## Terms
| Term | Meaning |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| Attached disk | A disk volume referenced in a VM config, such as `scsi0`, `sata0`, `virtio0`, `efidisk0`, or `tpmstate0`. |
| Orphan disk | A storage volume that exists on shared storage but is not referenced by any VM config on any node and is not opened by any active process. |
| Volume ID | Proxmox storage identifier, such as `iscsi-cluster-lvm:vm-107-disk-1.qcow2`. |
| LV | LVM logical volume, such as `/dev/vg_proxmox_iscsi/vm-107-disk-1.qcow2`. |
| Snapshot chain | A chain of qcow2 backing files or Proxmox snapshot volumes. |
---
# Phase 1: Identify Storage Names
Run this on any Proxmox node:
```bash
pvesm status
cat /etc/pve/storage.cfg
vgs
```
Identify the shared iSCSI/LVM storage.
Example:
```text
Storage ID: iscsi-cluster-lvm
VG name: vg_proxmox_iscsi
```
For the rest of this document, replace these values if your environment differs:
```bash
STORAGE="iscsi-cluster-lvm"
VG="vg_proxmox_iscsi"
```
---
# Phase 2: Run the Storage Orphan Audit
Run the following script on one node that can see the shared LVM storage.
This script does not delete anything.
```bash
cat > /root/pve-iscsi-orphan-audit.sh <<'EOF'
#!/usr/bin/env bash
set -u
STORAGE="iscsi-cluster-lvm"
VG="vg_proxmox_iscsi"
OUT="/root/pve-iscsi-orphan-audit-$(hostname)-$(date +%Y%m%d-%H%M%S).txt"
{
echo "===== PVE ISCSI ORPHAN AUDIT ====="
echo "Host: $(hostname)"
echo "Date: $(date)"
echo "Storage: ${STORAGE}"
echo "VG: ${VG}"
echo
echo "===== STORAGE STATUS ====="
pvesm status 2>&1 | egrep "^(Name|${STORAGE})" || true
vgs "${VG}" 2>&1 || true
echo
echo "===== ALL VOLUMES IN ${STORAGE} ====="
pvesm list "${STORAGE}" 2>&1 || true
echo
echo "===== ALL LVs IN ${VG} ====="
lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices "${VG}" 2>&1 || true
echo
echo "===== LOCAL VM CONFIG FILES ====="
for conf in /etc/pve/qemu-server/*.conf; do
[ -e "$conf" ] || continue
echo
echo "----- $conf -----"
cat "$conf"
done
echo
echo "===== REFERENCE ANALYSIS - LOCAL CONFIG FILES ONLY ====="
printf '%-55s | %-8s | %-10s | %-8s | %-8s | %s\n' \
"volume" "vmid" "referenced" "open" "size" "path"
{
pvesm list "${STORAGE}" 2>/dev/null | awk 'NR>1 {print $1}' | sed "s#^${STORAGE}:##"
lvs --noheadings -o lv_name "${VG}" 2>/dev/null | awk '{print $1}'
} | sort -u | while read -r vol; do
[ -n "$vol" ] || continue
case "$vol" in
vm-*-disk-*|snap_vm-*-disk-*) ;;
*) continue ;;
esac
vmid="unknown"
if [[ "$vol" =~ ^vm-([0-9]+)-disk- ]]; then
vmid="${BASH_REMATCH[1]}"
elif [[ "$vol" =~ ^snap_vm-([0-9]+)-disk- ]]; then
vmid="${BASH_REMATCH[1]}"
fi
ref="no"
if grep -R -Fq "$vol" /etc/pve/qemu-server 2>/dev/null; then
ref="yes"
fi
open="no"
if lsof 2>/dev/null | grep -Fq "$vol"; then
open="yes"
fi
size="$(lvs --noheadings -o lv_size "${VG}/${vol}" 2>/dev/null | awk '{$1=$1;print}')"
path="$(lvs --noheadings -o lv_path "${VG}/${vol}" 2>/dev/null | awk '{$1=$1;print}')"
printf '%-55s | %-8s | %-10s | %-8s | %-8s | %s\n' \
"$vol" "$vmid" "$ref" "$open" "$size" "$path"
done
echo
echo "===== DONE ====="
} | tee "$OUT"
echo
echo "Saved audit to: $OUT"
EOF
chmod +x /root/pve-iscsi-orphan-audit.sh
/root/pve-iscsi-orphan-audit.sh
```
The script writes a file similar to:
```text
/root/pve-iscsi-orphan-audit-<node>-<timestamp>.txt
```
---
## How to Read the First Audit
The most important section is:
```text
REFERENCE ANALYSIS - LOCAL CONFIG FILES ONLY
```
Example:
```text
volume | vmid | referenced | open | size
vm-107-disk-0.qcow2 | 107 | no | no | 4.00m
vm-107-disk-1.qcow2 | 107 | no | no | 256.04g
vm-107-disk-2.qcow2 | 107 | yes | no | 4.00m
vm-107-disk-3.qcow2 | 107 | yes | no | 256.04g
```
Interpretation:
| Field | Meaning |
| ---------------- | ------------------------------------------------------------------------------------------------------- |
| `referenced=yes` | The volume appears in a local VM config file. Do not delete. |
| `referenced=no` | The volume does not appear in local VM configs. It may be orphaned, but confirm across all nodes first. |
| `open=yes` | A process has the volume open. Do not delete. |
| `open=no` | No process on this node has the volume open. Still confirm across all nodes. |
!!! warning "Local reference analysis is not enough"
If a VM runs on another cluster node, its config may not appear on the node where you ran the audit. This can make valid disks look orphaned.
```
Continue to Phase 3 before deleting anything.
```
---
# Phase 3: Run Cluster-Wide Confirmation
Run the following script on **every Proxmox node** in the cluster.
This script is read-only.
```bash
cat > /root/pve-cluster-vm-confirm.sh <<'EOF'
#!/usr/bin/env bash
set -u
STORAGE="iscsi-cluster-lvm"
VG="vg_proxmox_iscsi"
OUT="/root/pve-cluster-vm-confirm-$(hostname)-$(date +%Y%m%d-%H%M%S).txt"
{
echo "===== NODE ====="
hostname
date
echo
echo "===== CLUSTER RESOURCES - VMS ====="
pvesh get /cluster/resources --type vm 2>&1 || true
echo
echo "===== LOCAL QM LIST ====="
qm list 2>&1 || true
echo
echo "===== QEMU CONFIG FILES PRESENT ====="
ls -la /etc/pve/qemu-server/ 2>&1 || true
echo
echo "===== QEMU CONFIG FILE CONTENTS ====="
for conf in /etc/pve/qemu-server/*.conf; do
[ -e "$conf" ] || continue
echo
echo "----- $conf -----"
cat "$conf"
done
echo
echo "===== ALL STORAGE VOLUMES ====="
pvesm list "${STORAGE}" 2>&1 || true
echo
echo "===== ALL LVs ====="
lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices "${VG}" 2>&1 || true
} | tee "$OUT"
echo
echo "Saved to: $OUT"
EOF
chmod +x /root/pve-cluster-vm-confirm.sh
/root/pve-cluster-vm-confirm.sh
```
Collect the output file from each node.
Example for a three-node cluster:
```text
/root/pve-cluster-vm-confirm-cluster-node-01-YYYYMMDD-HHMMSS.txt
/root/pve-cluster-vm-confirm-cluster-node-02-YYYYMMDD-HHMMSS.txt
/root/pve-cluster-vm-confirm-cluster-node-03-YYYYMMDD-HHMMSS.txt
```
---
## How to Read the Cluster Confirmation
For each suspicious volume, search all three outputs.
Example candidate:
```text
vm-107-disk-1.qcow2
```
Check whether it appears in any VM config:
```bash
grep -R "vm-107-disk-1.qcow2" /etc/pve/qemu-server/ || true
```
If reviewing output files manually, look for config lines such as:
```text
scsi0: iscsi-cluster-lvm:vm-107-disk-1.qcow2
sata0: iscsi-cluster-lvm:vm-107-disk-1.qcow2
virtio0: iscsi-cluster-lvm:vm-107-disk-1.qcow2
efidisk0: iscsi-cluster-lvm:vm-107-disk-1.qcow2
tpmstate0: iscsi-cluster-lvm:vm-107-disk-1.qcow2
```
If a volume appears in any of those lines, it is attached to a VM and must not be deleted.
---
# Phase 4: Classify Candidate Volumes
Use the following decision table.
| Condition | Classification | Action |
| --------------------------------------------------------------------- | ------------------- | ---------------------------------------------- |
| Volume appears in any VM config on any node | In use | Do not delete |
| Volume is opened by QEMU or another process | In use or unsafe | Do not delete |
| Volume is a `snap_vm-*` snapshot volume | Snapshot-chain item | Inspect snapshot/backing chain before deletion |
| Volume does not appear in any VM config and is not open | Orphan candidate | Eligible for final verification |
| VMID no longer exists in cluster resources and disk is not referenced | Strong orphan | Eligible for cleanup |
---
## Example: Valid VM Disks
If VM `107` has this config:
```text
efidisk0: iscsi-cluster-lvm:vm-107-disk-2.qcow2
sata0: iscsi-cluster-lvm:vm-107-disk-3.qcow2
```
Then these disks are valid and must not be deleted:
```text
vm-107-disk-2.qcow2
vm-107-disk-3.qcow2
```
If storage also contains:
```text
vm-107-disk-0.qcow2
vm-107-disk-1.qcow2
```
and neither appears in any config file on any node, those are orphan candidates.
---
# Phase 5: Final Verification Before Deletion
For each candidate volume, run the following checks on a node that can see the shared storage.
Replace the volume name as appropriate.
```bash
VOL="vm-107-disk-1.qcow2"
STORAGE="iscsi-cluster-lvm"
VG="vg_proxmox_iscsi"
echo "===== Check all cluster config references ====="
grep -R "$VOL" /etc/pve/qemu-server/ || true
echo
echo "===== Check Proxmox storage listing ====="
pvesm list "$STORAGE" | grep "$VOL" || true
echo
echo "===== Check LVM volume ====="
lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices "$VG" | grep "$VOL" || true
echo
echo "===== Check whether open by any process ====="
lsof | grep "$VOL" || true
echo
echo "===== Check qemu-img metadata if device path exists ====="
LVPATH="$(lvs --noheadings -o lv_path "${VG}/${VOL}" 2>/dev/null | awk '{$1=$1;print}')"
if [ -n "$LVPATH" ] && [ -e "$LVPATH" ]; then
qemu-img info --backing-chain "$LVPATH"
else
echo "LV path missing or inactive: $LVPATH"
fi
```
Safe deletion pattern:
```text
grep -R ... no output
pvesm list ... shows the volume
lvs ... shows the volume
lsof ... no output
qemu-img info ... no unexpected backing file dependency
```
!!! danger "Stop if grep finds a reference"
If the candidate volume appears in any `/etc/pve/qemu-server/*.conf` file, do not delete it.
!!! danger "Stop if lsof finds a process"
If `lsof` shows the volume is open, do not delete it.
---
# Phase 6: Cleanup Commands
## Preferred Method: Proxmox Storage Layer
Use `pvesm free` first.
Example:
```bash
pvesm free iscsi-cluster-lvm:vm-107-disk-0.qcow2
pvesm free iscsi-cluster-lvm:vm-107-disk-1.qcow2
```
Then verify:
```bash
pvesm list iscsi-cluster-lvm | grep "vm-107-disk" || true
lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi | grep "vm-107-disk" || true
vgs vg_proxmox_iscsi
pvesm status | egrep '^(Name|iscsi-cluster-lvm)'
```
Expected result:
```text
vm-107-disk-0.qcow2 gone
vm-107-disk-1.qcow2 gone
vm-107-disk-2.qcow2 still present
vm-107-disk-3.qcow2 still present
```
---
## Fallback Method: Direct LVM Removal
Only use this if `pvesm free` refuses and the final verification confirms the volume is not referenced and not open.
```bash
lvremove /dev/vg_proxmox_iscsi/vm-107-disk-0.qcow2
lvremove /dev/vg_proxmox_iscsi/vm-107-disk-1.qcow2
```
Then refresh device nodes and verify:
```bash
vgscan --mknodes
udevadm settle
pvesm list iscsi-cluster-lvm | grep "vm-107-disk" || true
lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi | grep "vm-107-disk" || true
vgs vg_proxmox_iscsi
```
!!! warning "Prefer pvesm free over lvremove"
`pvesm free` lets Proxmox remove the volume through its storage abstraction. Use direct `lvremove` only when Proxmox refuses and the orphan status is already proven.
---
# Phase 7: Post-Cleanup Validation
After deleting orphan volumes, validate storage and VM health.
```bash
pvesm status
vgs vg_proxmox_iscsi
lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi
```
Check the affected VMs config:
```bash
qm config 107
```
Confirm the VM still starts or remains healthy:
```bash
qm status 107
```
If the VM is running, confirm its active QEMU process only references expected disks:
```bash
ps auxww | grep "kvm -id 107" | grep -o "/dev/vg_proxmox_iscsi/[^ ,\"]*" | sort -u
```
Expected example:
```text
/dev/vg_proxmox_iscsi/vm-107-disk-2.qcow2
/dev/vg_proxmox_iscsi/vm-107-disk-3.qcow2
```
---
# Snapshot Volume Handling
Snapshot volumes require additional review.
Examples:
```text
snap_vm-105-disk-0_Fresh_Install.qcow2
snap_vm-106-disk-0_Fresh_Install_FullyUpdated.qcow2
```
Before deleting a snapshot volume, check:
```bash
qm config <vmid>
qm listsnapshot <vmid>
grep -R "snap_vm-<vmid>" /etc/pve/qemu-server/ || true
qemu-img info --backing-chain /dev/vg_proxmox_iscsi/vm-<vmid>-disk-<n>.qcow2
```
If the VM still has a `parent:` line or `qm listsnapshot` shows the snapshot, remove it through Proxmox first:
```bash
qm delsnapshot <vmid> <snapshot-name>
```
Only consider manual removal if:
* the VM no longer references the snapshot,
* no backing chain references the snapshot volume,
* no QEMU process has it open,
* and Proxmox cannot delete it normally.
!!! danger "Do not manually delete active snapshot-chain volumes"
Deleting an active snapshot backing volume can corrupt the VM disk chain.
---
# Example Cleanup Walkthrough
## Scenario
VM `107` has this config:
```text
efidisk0: iscsi-cluster-lvm:vm-107-disk-2.qcow2
sata0: iscsi-cluster-lvm:vm-107-disk-3.qcow2
```
Storage contains:
```text
vm-107-disk-0.qcow2
vm-107-disk-1.qcow2
vm-107-disk-2.qcow2
vm-107-disk-3.qcow2
```
`disk-0` and `disk-1` do not appear in any config and are not open by any process.
## Verify
```bash
grep -R "vm-107-disk-0.qcow2" /etc/pve/qemu-server/ || true
grep -R "vm-107-disk-1.qcow2" /etc/pve/qemu-server/ || true
lsof | grep "vm-107-disk-0.qcow2" || true
lsof | grep "vm-107-disk-1.qcow2" || true
```
Expected output:
```text
no output
```
## Delete
```bash
pvesm free iscsi-cluster-lvm:vm-107-disk-0.qcow2
pvesm free iscsi-cluster-lvm:vm-107-disk-1.qcow2
```
## Validate
```bash
pvesm list iscsi-cluster-lvm | grep "vm-107-disk"
lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi | grep "vm-107-disk"
vgs vg_proxmox_iscsi
```
Expected remaining volumes:
```text
vm-107-disk-2.qcow2
vm-107-disk-3.qcow2
```
---
# Technician Checklist
Use this checklist before removing any orphan disk.
* [ ] I ran the storage orphan audit.
* [ ] I ran the cluster confirmation script on every Proxmox node.
* [ ] I confirmed the candidate volume is not referenced in any VM config.
* [ ] I confirmed the candidate volume is not open by any process.
* [ ] I confirmed the candidate volume is not part of an active snapshot chain.
* [ ] I confirmed the VMID relationship is understood.
* [ ] I used `pvesm free` first.
* [ ] I used `lvremove` only if Proxmox refused and the volume was proven orphaned.
* [ ] I validated storage state after cleanup.
* [ ] I validated the affected VM still references only expected disks.
---
# Quick Reference Commands
## List shared storage volumes
```bash
pvesm list iscsi-cluster-lvm
```
## List LVs
```bash
lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices vg_proxmox_iscsi
```
## Search VM configs
```bash
grep -R "vm-<vmid>-disk-<n>" /etc/pve/qemu-server/ || true
```
## Check open files
```bash
lsof | grep "vm-<vmid>-disk-<n>" || true
```
## Check image metadata
```bash
qemu-img info --backing-chain /dev/vg_proxmox_iscsi/vm-<vmid>-disk-<n>.qcow2
```
## Delete via Proxmox
```bash
pvesm free iscsi-cluster-lvm:vm-<vmid>-disk-<n>.qcow2
```
## Delete via LVM fallback
```bash
lvremove /dev/vg_proxmox_iscsi/vm-<vmid>-disk-<n>.qcow2
```
## Verify storage usage
```bash
pvesm status
vgs vg_proxmox_iscsi
```