diff --git a/deployments/platforms/virtualization/proxmox/Detecting and Removing Orphaned VM Disks.md b/deployments/platforms/virtualization/proxmox/Detecting and Removing Orphaned VM Disks.md new file mode 100644 index 0000000..64fc070 --- /dev/null +++ b/deployments/platforms/virtualization/proxmox/Detecting and Removing Orphaned VM Disks.md @@ -0,0 +1,703 @@ +# Proxmox VE Shared iSCSI/LVM Orphan Disk Audit and Cleanup Procedure + +## Purpose + +This procedure describes how to identify and safely remove orphaned Proxmox VE virtual machine disks from shared iSCSI-backed LVM storage. + +It is intended for environments where: + +* Proxmox VE is clustered. +* Multiple Proxmox nodes access the same shared iSCSI LUN. +* The shared storage is exposed to Proxmox as LVM storage. +* VM disks are stored as LVM logical volumes. +* Some volumes may remain after VM disk deletion, failed migrations, failed resizes, storage UI inconsistencies, or manual recovery work. + +The goal is to reclaim storage space without accidentally deleting disks that are still attached to running or stopped VMs. + +--- + +## Scope + +This document focuses on the following storage type: + +```text +Proxmox storage type: lvm +Backing storage: shared iSCSI +Volume group: vg_proxmox_iscsi +Storage ID example: iscsi-cluster-lvm +``` + +Adjust the storage ID and volume group names as needed for your environment. + +--- + +## Safety Requirements + +!!! danger "Never delete based on the Storage UI alone" +The Proxmox storage UI may show a volume as belonging to a VM because its name follows the pattern `vm--disk-`. That does not prove the disk is currently attached to the VM. + +``` +Always verify against VM configuration files and active QEMU processes before deleting. +``` + +!!! warning "Run the audit before running any cleanup commands" +The audit scripts in this document are read-only. The cleanup commands are destructive. Do not run cleanup commands until the audit output has been reviewed. + +!!! warning "Snapshot volumes require extra caution" +Volumes named like the following may be part of a snapshot chain: + +```` +```text +snap_vm--disk-_ +``` + +Do not remove snapshot volumes manually unless you have verified that the VM and snapshot are no longer known to Proxmox, no backing chain references them, and no QEMU process has them open. +```` + +!!! note "Shared storage does not mean shared config" +In some cluster layouts, each node may only show VM config files for VMs assigned to that node. Therefore, an audit run from only one node can falsely report disks from other nodes as orphaned. + +``` +Run the confirmation script on every node in the cluster. +``` + +--- + +## Terms + +| Term | Meaning | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| Attached disk | A disk volume referenced in a VM config, such as `scsi0`, `sata0`, `virtio0`, `efidisk0`, or `tpmstate0`. | +| Orphan disk | A storage volume that exists on shared storage but is not referenced by any VM config on any node and is not opened by any active process. | +| Volume ID | Proxmox storage identifier, such as `iscsi-cluster-lvm:vm-107-disk-1.qcow2`. | +| LV | LVM logical volume, such as `/dev/vg_proxmox_iscsi/vm-107-disk-1.qcow2`. | +| Snapshot chain | A chain of qcow2 backing files or Proxmox snapshot volumes. | + +--- + +# Phase 1: Identify Storage Names + +Run this on any Proxmox node: + +```bash +pvesm status +cat /etc/pve/storage.cfg +vgs +``` + +Identify the shared iSCSI/LVM storage. + +Example: + +```text +Storage ID: iscsi-cluster-lvm +VG name: vg_proxmox_iscsi +``` + +For the rest of this document, replace these values if your environment differs: + +```bash +STORAGE="iscsi-cluster-lvm" +VG="vg_proxmox_iscsi" +``` + +--- + +# Phase 2: Run the Storage Orphan Audit + +Run the following script on one node that can see the shared LVM storage. + +This script does not delete anything. + +```bash +cat > /root/pve-iscsi-orphan-audit.sh <<'EOF' +#!/usr/bin/env bash +set -u + +STORAGE="iscsi-cluster-lvm" +VG="vg_proxmox_iscsi" +OUT="/root/pve-iscsi-orphan-audit-$(hostname)-$(date +%Y%m%d-%H%M%S).txt" + +{ + echo "===== PVE ISCSI ORPHAN AUDIT =====" + echo "Host: $(hostname)" + echo "Date: $(date)" + echo "Storage: ${STORAGE}" + echo "VG: ${VG}" + + echo + echo "===== STORAGE STATUS =====" + pvesm status 2>&1 | egrep "^(Name|${STORAGE})" || true + vgs "${VG}" 2>&1 || true + + echo + echo "===== ALL VOLUMES IN ${STORAGE} =====" + pvesm list "${STORAGE}" 2>&1 || true + + echo + echo "===== ALL LVs IN ${VG} =====" + lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices "${VG}" 2>&1 || true + + echo + echo "===== LOCAL VM CONFIG FILES =====" + for conf in /etc/pve/qemu-server/*.conf; do + [ -e "$conf" ] || continue + echo + echo "----- $conf -----" + cat "$conf" + done + + echo + echo "===== REFERENCE ANALYSIS - LOCAL CONFIG FILES ONLY =====" + printf '%-55s | %-8s | %-10s | %-8s | %-8s | %s\n' \ + "volume" "vmid" "referenced" "open" "size" "path" + + { + pvesm list "${STORAGE}" 2>/dev/null | awk 'NR>1 {print $1}' | sed "s#^${STORAGE}:##" + lvs --noheadings -o lv_name "${VG}" 2>/dev/null | awk '{print $1}' + } | sort -u | while read -r vol; do + [ -n "$vol" ] || continue + + case "$vol" in + vm-*-disk-*|snap_vm-*-disk-*) ;; + *) continue ;; + esac + + vmid="unknown" + if [[ "$vol" =~ ^vm-([0-9]+)-disk- ]]; then + vmid="${BASH_REMATCH[1]}" + elif [[ "$vol" =~ ^snap_vm-([0-9]+)-disk- ]]; then + vmid="${BASH_REMATCH[1]}" + fi + + ref="no" + if grep -R -Fq "$vol" /etc/pve/qemu-server 2>/dev/null; then + ref="yes" + fi + + open="no" + if lsof 2>/dev/null | grep -Fq "$vol"; then + open="yes" + fi + + size="$(lvs --noheadings -o lv_size "${VG}/${vol}" 2>/dev/null | awk '{$1=$1;print}')" + path="$(lvs --noheadings -o lv_path "${VG}/${vol}" 2>/dev/null | awk '{$1=$1;print}')" + + printf '%-55s | %-8s | %-10s | %-8s | %-8s | %s\n' \ + "$vol" "$vmid" "$ref" "$open" "$size" "$path" + done + + echo + echo "===== DONE =====" +} | tee "$OUT" + +echo +echo "Saved audit to: $OUT" +EOF + +chmod +x /root/pve-iscsi-orphan-audit.sh +/root/pve-iscsi-orphan-audit.sh +``` + +The script writes a file similar to: + +```text +/root/pve-iscsi-orphan-audit--.txt +``` + +--- + +## How to Read the First Audit + +The most important section is: + +```text +REFERENCE ANALYSIS - LOCAL CONFIG FILES ONLY +``` + +Example: + +```text +volume | vmid | referenced | open | size +vm-107-disk-0.qcow2 | 107 | no | no | 4.00m +vm-107-disk-1.qcow2 | 107 | no | no | 256.04g +vm-107-disk-2.qcow2 | 107 | yes | no | 4.00m +vm-107-disk-3.qcow2 | 107 | yes | no | 256.04g +``` + +Interpretation: + +| Field | Meaning | +| ---------------- | ------------------------------------------------------------------------------------------------------- | +| `referenced=yes` | The volume appears in a local VM config file. Do not delete. | +| `referenced=no` | The volume does not appear in local VM configs. It may be orphaned, but confirm across all nodes first. | +| `open=yes` | A process has the volume open. Do not delete. | +| `open=no` | No process on this node has the volume open. Still confirm across all nodes. | + +!!! warning "Local reference analysis is not enough" +If a VM runs on another cluster node, its config may not appear on the node where you ran the audit. This can make valid disks look orphaned. + +``` +Continue to Phase 3 before deleting anything. +``` + +--- + +# Phase 3: Run Cluster-Wide Confirmation + +Run the following script on **every Proxmox node** in the cluster. + +This script is read-only. + +```bash +cat > /root/pve-cluster-vm-confirm.sh <<'EOF' +#!/usr/bin/env bash +set -u + +STORAGE="iscsi-cluster-lvm" +VG="vg_proxmox_iscsi" +OUT="/root/pve-cluster-vm-confirm-$(hostname)-$(date +%Y%m%d-%H%M%S).txt" + +{ + echo "===== NODE =====" + hostname + date + + echo + echo "===== CLUSTER RESOURCES - VMS =====" + pvesh get /cluster/resources --type vm 2>&1 || true + + echo + echo "===== LOCAL QM LIST =====" + qm list 2>&1 || true + + echo + echo "===== QEMU CONFIG FILES PRESENT =====" + ls -la /etc/pve/qemu-server/ 2>&1 || true + + echo + echo "===== QEMU CONFIG FILE CONTENTS =====" + for conf in /etc/pve/qemu-server/*.conf; do + [ -e "$conf" ] || continue + echo + echo "----- $conf -----" + cat "$conf" + done + + echo + echo "===== ALL STORAGE VOLUMES =====" + pvesm list "${STORAGE}" 2>&1 || true + + echo + echo "===== ALL LVs =====" + lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices "${VG}" 2>&1 || true + +} | tee "$OUT" + +echo +echo "Saved to: $OUT" +EOF + +chmod +x /root/pve-cluster-vm-confirm.sh +/root/pve-cluster-vm-confirm.sh +``` + +Collect the output file from each node. + +Example for a three-node cluster: + +```text +/root/pve-cluster-vm-confirm-cluster-node-01-YYYYMMDD-HHMMSS.txt +/root/pve-cluster-vm-confirm-cluster-node-02-YYYYMMDD-HHMMSS.txt +/root/pve-cluster-vm-confirm-cluster-node-03-YYYYMMDD-HHMMSS.txt +``` + +--- + +## How to Read the Cluster Confirmation + +For each suspicious volume, search all three outputs. + +Example candidate: + +```text +vm-107-disk-1.qcow2 +``` + +Check whether it appears in any VM config: + +```bash +grep -R "vm-107-disk-1.qcow2" /etc/pve/qemu-server/ || true +``` + +If reviewing output files manually, look for config lines such as: + +```text +scsi0: iscsi-cluster-lvm:vm-107-disk-1.qcow2 +sata0: iscsi-cluster-lvm:vm-107-disk-1.qcow2 +virtio0: iscsi-cluster-lvm:vm-107-disk-1.qcow2 +efidisk0: iscsi-cluster-lvm:vm-107-disk-1.qcow2 +tpmstate0: iscsi-cluster-lvm:vm-107-disk-1.qcow2 +``` + +If a volume appears in any of those lines, it is attached to a VM and must not be deleted. + +--- + +# Phase 4: Classify Candidate Volumes + +Use the following decision table. + +| Condition | Classification | Action | +| --------------------------------------------------------------------- | ------------------- | ---------------------------------------------- | +| Volume appears in any VM config on any node | In use | Do not delete | +| Volume is opened by QEMU or another process | In use or unsafe | Do not delete | +| Volume is a `snap_vm-*` snapshot volume | Snapshot-chain item | Inspect snapshot/backing chain before deletion | +| Volume does not appear in any VM config and is not open | Orphan candidate | Eligible for final verification | +| VMID no longer exists in cluster resources and disk is not referenced | Strong orphan | Eligible for cleanup | + +--- + +## Example: Valid VM Disks + +If VM `107` has this config: + +```text +efidisk0: iscsi-cluster-lvm:vm-107-disk-2.qcow2 +sata0: iscsi-cluster-lvm:vm-107-disk-3.qcow2 +``` + +Then these disks are valid and must not be deleted: + +```text +vm-107-disk-2.qcow2 +vm-107-disk-3.qcow2 +``` + +If storage also contains: + +```text +vm-107-disk-0.qcow2 +vm-107-disk-1.qcow2 +``` + +and neither appears in any config file on any node, those are orphan candidates. + +--- + +# Phase 5: Final Verification Before Deletion + +For each candidate volume, run the following checks on a node that can see the shared storage. + +Replace the volume name as appropriate. + +```bash +VOL="vm-107-disk-1.qcow2" +STORAGE="iscsi-cluster-lvm" +VG="vg_proxmox_iscsi" + +echo "===== Check all cluster config references =====" +grep -R "$VOL" /etc/pve/qemu-server/ || true + +echo +echo "===== Check Proxmox storage listing =====" +pvesm list "$STORAGE" | grep "$VOL" || true + +echo +echo "===== Check LVM volume =====" +lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices "$VG" | grep "$VOL" || true + +echo +echo "===== Check whether open by any process =====" +lsof | grep "$VOL" || true + +echo +echo "===== Check qemu-img metadata if device path exists =====" +LVPATH="$(lvs --noheadings -o lv_path "${VG}/${VOL}" 2>/dev/null | awk '{$1=$1;print}')" +if [ -n "$LVPATH" ] && [ -e "$LVPATH" ]; then + qemu-img info --backing-chain "$LVPATH" +else + echo "LV path missing or inactive: $LVPATH" +fi +``` + +Safe deletion pattern: + +```text +grep -R ... no output +pvesm list ... shows the volume +lvs ... shows the volume +lsof ... no output +qemu-img info ... no unexpected backing file dependency +``` + +!!! danger "Stop if grep finds a reference" +If the candidate volume appears in any `/etc/pve/qemu-server/*.conf` file, do not delete it. + +!!! danger "Stop if lsof finds a process" +If `lsof` shows the volume is open, do not delete it. + +--- + +# Phase 6: Cleanup Commands + +## Preferred Method: Proxmox Storage Layer + +Use `pvesm free` first. + +Example: + +```bash +pvesm free iscsi-cluster-lvm:vm-107-disk-0.qcow2 +pvesm free iscsi-cluster-lvm:vm-107-disk-1.qcow2 +``` + +Then verify: + +```bash +pvesm list iscsi-cluster-lvm | grep "vm-107-disk" || true +lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi | grep "vm-107-disk" || true +vgs vg_proxmox_iscsi +pvesm status | egrep '^(Name|iscsi-cluster-lvm)' +``` + +Expected result: + +```text +vm-107-disk-0.qcow2 gone +vm-107-disk-1.qcow2 gone +vm-107-disk-2.qcow2 still present +vm-107-disk-3.qcow2 still present +``` + +--- + +## Fallback Method: Direct LVM Removal + +Only use this if `pvesm free` refuses and the final verification confirms the volume is not referenced and not open. + +```bash +lvremove /dev/vg_proxmox_iscsi/vm-107-disk-0.qcow2 +lvremove /dev/vg_proxmox_iscsi/vm-107-disk-1.qcow2 +``` + +Then refresh device nodes and verify: + +```bash +vgscan --mknodes +udevadm settle + +pvesm list iscsi-cluster-lvm | grep "vm-107-disk" || true +lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi | grep "vm-107-disk" || true +vgs vg_proxmox_iscsi +``` + +!!! warning "Prefer pvesm free over lvremove" +`pvesm free` lets Proxmox remove the volume through its storage abstraction. Use direct `lvremove` only when Proxmox refuses and the orphan status is already proven. + +--- + +# Phase 7: Post-Cleanup Validation + +After deleting orphan volumes, validate storage and VM health. + +```bash +pvesm status +vgs vg_proxmox_iscsi +lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi +``` + +Check the affected VM’s config: + +```bash +qm config 107 +``` + +Confirm the VM still starts or remains healthy: + +```bash +qm status 107 +``` + +If the VM is running, confirm its active QEMU process only references expected disks: + +```bash +ps auxww | grep "kvm -id 107" | grep -o "/dev/vg_proxmox_iscsi/[^ ,\"]*" | sort -u +``` + +Expected example: + +```text +/dev/vg_proxmox_iscsi/vm-107-disk-2.qcow2 +/dev/vg_proxmox_iscsi/vm-107-disk-3.qcow2 +``` + +--- + +# Snapshot Volume Handling + +Snapshot volumes require additional review. + +Examples: + +```text +snap_vm-105-disk-0_Fresh_Install.qcow2 +snap_vm-106-disk-0_Fresh_Install_FullyUpdated.qcow2 +``` + +Before deleting a snapshot volume, check: + +```bash +qm config +qm listsnapshot +grep -R "snap_vm-" /etc/pve/qemu-server/ || true +qemu-img info --backing-chain /dev/vg_proxmox_iscsi/vm--disk-.qcow2 +``` + +If the VM still has a `parent:` line or `qm listsnapshot` shows the snapshot, remove it through Proxmox first: + +```bash +qm delsnapshot +``` + +Only consider manual removal if: + +* the VM no longer references the snapshot, +* no backing chain references the snapshot volume, +* no QEMU process has it open, +* and Proxmox cannot delete it normally. + +!!! danger "Do not manually delete active snapshot-chain volumes" +Deleting an active snapshot backing volume can corrupt the VM disk chain. + +--- + +# Example Cleanup Walkthrough + +## Scenario + +VM `107` has this config: + +```text +efidisk0: iscsi-cluster-lvm:vm-107-disk-2.qcow2 +sata0: iscsi-cluster-lvm:vm-107-disk-3.qcow2 +``` + +Storage contains: + +```text +vm-107-disk-0.qcow2 +vm-107-disk-1.qcow2 +vm-107-disk-2.qcow2 +vm-107-disk-3.qcow2 +``` + +`disk-0` and `disk-1` do not appear in any config and are not open by any process. + +## Verify + +```bash +grep -R "vm-107-disk-0.qcow2" /etc/pve/qemu-server/ || true +grep -R "vm-107-disk-1.qcow2" /etc/pve/qemu-server/ || true + +lsof | grep "vm-107-disk-0.qcow2" || true +lsof | grep "vm-107-disk-1.qcow2" || true +``` + +Expected output: + +```text +no output +``` + +## Delete + +```bash +pvesm free iscsi-cluster-lvm:vm-107-disk-0.qcow2 +pvesm free iscsi-cluster-lvm:vm-107-disk-1.qcow2 +``` + +## Validate + +```bash +pvesm list iscsi-cluster-lvm | grep "vm-107-disk" +lvs -a -o lv_name,lv_size,lv_attr,devices vg_proxmox_iscsi | grep "vm-107-disk" +vgs vg_proxmox_iscsi +``` + +Expected remaining volumes: + +```text +vm-107-disk-2.qcow2 +vm-107-disk-3.qcow2 +``` + +--- + +# Technician Checklist + +Use this checklist before removing any orphan disk. + +* [ ] I ran the storage orphan audit. +* [ ] I ran the cluster confirmation script on every Proxmox node. +* [ ] I confirmed the candidate volume is not referenced in any VM config. +* [ ] I confirmed the candidate volume is not open by any process. +* [ ] I confirmed the candidate volume is not part of an active snapshot chain. +* [ ] I confirmed the VMID relationship is understood. +* [ ] I used `pvesm free` first. +* [ ] I used `lvremove` only if Proxmox refused and the volume was proven orphaned. +* [ ] I validated storage state after cleanup. +* [ ] I validated the affected VM still references only expected disks. + +--- + +# Quick Reference Commands + +## List shared storage volumes + +```bash +pvesm list iscsi-cluster-lvm +``` + +## List LVs + +```bash +lvs -a -o lv_name,lv_path,lv_size,lv_attr,devices vg_proxmox_iscsi +``` + +## Search VM configs + +```bash +grep -R "vm--disk-" /etc/pve/qemu-server/ || true +``` + +## Check open files + +```bash +lsof | grep "vm--disk-" || true +``` + +## Check image metadata + +```bash +qemu-img info --backing-chain /dev/vg_proxmox_iscsi/vm--disk-.qcow2 +``` + +## Delete via Proxmox + +```bash +pvesm free iscsi-cluster-lvm:vm--disk-.qcow2 +``` + +## Delete via LVM fallback + +```bash +lvremove /dev/vg_proxmox_iscsi/vm--disk-.qcow2 +``` + +## Verify storage usage + +```bash +pvesm status +vgs vg_proxmox_iscsi +```